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This  research  project  was  carried  out  with  the  purpose 
of  investigating  some  of  the  idiosyncratic  speech  characteris- 
tics which  permit  an  individual  to  be  identified  from  his  voice 
alone.     The  specific  objectives  of  this  study  were:     (1)  select 
and  examine  certain  temporal  speech  parameters,  with  reference 
to  their  speaker  identification  capabilities,    (2)  test  the 
speaker  identification  effectiveness  of  the  selected  parameters 
under  stress  and  disguise  conditions,  and  (3)  examine  the  af- 
fects of  simulated  field  conditions  on  the  speaker  identifi- 
cation capabilities  of  the  selected  temporal  vectors. 

From  all  the  possible  temporal  characteristics  which 
exist  within  the  speech  signal,   four  general  sets  of 
parameters  were  chosen.     These  temporal  vectors  included 
durational  analysis  of:      (1)  relative  energy  at  several 
levels  of  intensity,    (2)  voiced  and  voiceless  activity, 
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(3)  vowel/consonant  ratios,  and  (4)   specific  words  and 
phrases.     Each  of  these  vectors  was  composed  of  from 
2  to  40  variables.     These  temporal  vectors  were  extracted 
from  speech  samples  generated  from  three  experiments. 

The  initial  experiment  was  a  laboratory-based  study. 
Forty  adult  males  read  a  standard  prose  passage  while 
being  recorded  in  an  "ideal"  laboratory  setting.  The 
results  of  this,  the  first,  experiment  demonstrated  the 
time-energy  distribution  (TED)   vector  as  the  most  effective 
of  the  selected  temporal  parameters.     The  voiced/voiceless 
speech  time  (WL)  ,  vowel/consonant  duration  ratio  (V/C) , 
and  word  and  phrase  duration  (WPD)  vectors  followed  in 
descending  order  of  identification  effectiveness. 

The  second  experiment  in  this  research  also  was 
laboratory-based.     In  this  case,  the  subjects   (20  adult 
males)  were  recorded  under  similar  conditions  as  those  of 
the  first  experiment.     However,  these  subjects  read  the 
passage  in  three  different  manners.     These  speaking  con- 
ditions were:      (1)   normal,    (2)   stress   (applied  via  electric 
shock),  and  (3)   free  disguise.     This  experiment  resulted  in 
the  same  vector  effectiveness  as  the  first  experiment. 
That  is,  application  of  the  TED  vector  yielded  the 
highest  levels  of  identification  and  the  WL,  V/C,  and 
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WPD  followed  in  effectiveness.     In  addition,  it  was  found 
that  stress  and  disguise  speaking  conditions  do  reduce  the 
identification  power  of  the  selected  temporal  vectors.  It 
should  be  noted     that,  while  the  disguise  condition  yielded 
much  lower  scores  than  the  normal,  this  condition  was 
higher  than  any  other  similar  studies. 

In  the  third  study,  the  temporal  parameters  were  inves- 
tigated under  conditions  which  would  parallel  those  found 
in  the  forensic  model.     A  speaker  simulated  a  "crime"  over 
the  telephone  and  a  "suspect  pool"  was  created  by  recording 
subjects  in  a  simulated  interrogation  procedure.  The 
findings  in  this  study  demonstrated  that  the  vectors  were 
relatively  ineffectual  in  this  very  restrictive  situation. 
However,   the  TED  and  WL  vectors  did  show  some  limited 
potential;   indicating  that  these  vectors  may,  at  some 
later  date,  be  useful  in  a  speaker  identification  system 
suitable  for  the  forensic  world. 

In  general,  a  few  overall  conclusions  can  be  made 
based  on  the  findings  of  the  three  completed  studies. 

1.  Temporal  characteristics  found  within  the 
speech  signal  are  important  in  the  speaker 
identification  process. 

2.  Certain  temporal  characteristics  are  idiosyncratic 
of  an  individual's  speech  patterns. 
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3.  Stressful  and  disguised  speaking  conditions 
reduce  the  levels  of  identification  exhibited 
by  these  selected  temporal  vectors. 

4.  The  temporal  parameters  examined  in  this 
research  program  are  less  effected  than 
frequency  parameters  when  a  speaker 
disguises  his  voice. 

5.  The  restrictive  condition  of  a  simulated  field 
situation  greatly  interferes  with  the  identifi- 
cation powers  of  these  temporal  vectors. 

6.  The  temporal  parameters  may  be  a  useful 
addition  to  an  established  speaker  identifi- 
cation system. 
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CHAPTER  I 


INTRODUCTION 

In  general,  the  most  important  information  contained 
within  the  speech  signal,  produced  by  a  given  individual, 
is  the  linguistic  message.     However,  this  message  is  by  no 
means  the  only  information  transmitted  to  a  listener  via  the 
utterance.     Data  about  the  speaker's  general  emotional 
state,  educational  background,  geographic  origin,  and/or 
specific  identity  also  may  be  provided.    All  this  informa- 
tion is  important  and  warrants  investigation.     However,  it 
was  the  focus  of  this  research  to  examine  only  the  speaker 
identifying  information  which  is  conveyed  through  the 
speech  signal. 

The  identification  of  a  speaker  from  his/her  voice 
alone  routinely  occurs  under  a  variety  of  familiar  circum- 
stances; for  example,   in  telephone  conversations,  at  cock- 
tail parties,   from  radio  broadcasts,  etc.     In  some  situa- 
tions,  it  is  not  just  desirable  but  crucial  to  be  able  to 
extract  the  speaker's  identity  from  his  voice  alone.  For 
example,    "infallible"  speaker  identification  techniques 
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must  be  available  to  the  military  before  voice  activated 
weaponry  can  be  developed  and  utilized  with  safety.  In 
addition,  businesses,  banks,  security  companies  and 
similar  organizations  have  a  need  for  voice  activated  com- 
puters, electronic  devices,  and  machinery. 

Speaker  identification  is  of  special  interest  to  law 
enforcement  agencies.     Currently,  the  recording  of  con- 
versations by  criminals  and/or  suspects  is  common  prac- 
tice in  criminal  investigations.     In  addition,  these 
recorded  conversations  often  are  admissible  in  courts  of 
law.     Therefore,  a  reliable  and  objective  speaker 
identification  system  would  be  of  inestimable  value  in  the 
identification  and  conviction  of  criminals. 

A  substantial  amount  of  research  has  been  carried 
out  in  response  to  the  need  for  speaker  identification 
techniques  that  are  reliable  and  objective.     This  re- 
search may  be  classified  into  three  categories:  (1) 
aural/perceptual  speaker  recognition,    (2)  visual  recogni- 
tion (spectrogram  matching) ,  and  (3)  machine  recogni- 
tion.    The  aural/perceptual  "method"  is      simply  speaker 
recognition  by  listening.     That  is,  the  technique 
utilizes  the  abilities  of  the  human  auditory  system  and 
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the  cognitive  powers  of  the  human  brain  to  determine 
the  identity  of  a  speaker.     Research  examining  the  aural/ 
perceptual  speaker  recognition  has  shown  it  to  be  quite 
valid  under  some  circumstances   (e.  g.,  when  the  talker 
is  well  known  to  the  listener) ,  but  severe  limitations 
also  have  been  demonstrated.    A  review  and  discussion  of 
the  literature  appropriate  to  aural/perceptual  speaker 
identification  will  be  found  below. 

The  so-called  "spectrographic  method"  of  speaker 
identification  uses  the  aural/perceptual  approach  com- 
bined with  a  visual  pattern  matching  technique  based 
upon  frequency-by-time-by-intensity  sound  spectrograms 
("voice-prints").     These  spectrograms  are  compared  by 
means  of  a  pattern  matching  procedure.    A  review  of  the 
relevant  spectrographic  speaker  identification  litera-  • 
ture  also  will  be  presented  below.    However,  controversy 
exists  relevant  to  the  predictive  value  of  the  "voice- 
print"  research;  review  of  this  conflict  may  be  found 
in  the  following:     Black  et  al.,   1973;  Bolt  et  al., 
1970,   1973;  Hollien,   1974,   1977;  and  Hollien  and  McGlone, 
1976. 
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The  third  general  approach  currently  being  employed 
in  speaker  identification  tasks  utilizes  sophisticated 
electronic  devices   (other  than  the  sound  spectrograph) . 
In  reality  there  are  many  "machine"  approaches;  however, 
all  appear  to  exhibit  four  important  advantages:  (1) 
the  specific  acoustic  and/or  temporal  parameters  to  be 
employed  may  be  extracted  from  the  speech  signal  serially 
and/or  simultaneously,    (2)   the  parameters   (or  group  of 
parameters)   utilized  may  be  used  in  various  combinations, 
(3)   the  subjectivity  of  human  analysis  is  eliminated  to 
a  great  degree   (Hollien  and  Majewski,  1977) ,  and  (4) 
the  analysis  can  be  done  to  any  level  of  desired 
accuracy.     In  sum,  the  machine  approach  appears  to  show 
the  greatest  potential  for  ultimately  producing  a  valid 
and  objective  speaker  identification  system. 

It  should  be  pointed  out  that  most  research  using 
machine  approach  utilizes  parameters  based  on  acoustic 
analysis — particularly  frequency  spectrum  and  funda- 
mental frequency — and  the  importance  of  these  speaker 
characteristics  to  speaker  identification  has  been  demon- 
strated in  the  research  literature.     On  the  other  hand, 
there  are  various  temporal  parameters  that  can  be  extracted 
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from  the  speech  signal;  they  generally  have  not  been 
studied  for  potential  as  speaker  identification  cues. 
However,  temporal  speech  parameters,  such  as  vowel  and 
consonant  duration,  have  been  extensively  investigated 
relative  to  the  area  of  speech  perception.     A  review  of 
some  of  the  relevant  literature  may  be  found  in  Lehiste 
(1967)  . 

In  summary,  speaker  recognition  has  been  examined  by 
three  approaches:     aural/perceptual  speaker  recognition; 
spectrographic  or  "voiceprint "  speaker  identification; 
and  automatic  and  semiautomatic  recognition.     Each  of 
these  approaches  have  been  examined  in  the  research  litera- 
ture and  some  of  their  advantages  and  disadvantages  ex- 
plored.   A  review  of  the  relevant  literature  follows. 

Review  of  the  Literature 

Aural/Perceptual 
Speaker  Identification 

One  of  the  earliest  systematic  attempts  to  examine 
speaker  identification  by  listening   (aural)  was  reported  by 
McGehee   (1937) .     In  this  study  she  first  concealed  auditors 
behind  a  screen  and  had  a  single  speaker  read  (to  them)  a 
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passage  of  56  words.     These  auditors  returned  the  second  day 
for  a  second  set  of  trials.     In  this  case  they  listened  to 
five  speakers   (including  the  original  talker)  reading  the 
same  56  word  passage  and  they  were  asked  to  identify  which 
of  the  speakers  was  the  original  talker.     McGehee  reports  a 
correct  identification  rate  of  83  per  cent.     The  process 
was  repeated  at   two  week,  three  month,  and  five  month  inter- 
vals.    In  these  cases  the  scores  for  correct  identification 
were  68  per  cent,  35  per  cent,  and  13  per  cent,  respectively. 
On  this  basis,  McGehee  concluded  that  the  ability  to  identify 
speakers  aurally  deteriorated  as  a  function  of  time. 

In  a  later  study,  Pollack,  Pickett,  and  Sumby  (1954) 
investigated  certain  aspects  of  the  aural  speaker  identifi- 
cation task.     They  had  normal  male  talkers  of  similar  age, 
dialect,  and  rate  of  speaking  read  a  single  passage  both  in 
a  whispered  and  voiced  speaking  mode.     Comparisons  were  made 
between  the  identification  scores  of  these  two  speaking 
conditions.     The  authors  reported  that  if  similar  correct 
identification  scores  were  to  be  obtained  for  both  condi- 
tions, an  utterance  duration  for  the  whispered  passage  had 
to  be  three  times  that  of  the  voiced  passage.     On  this 
basis  the  authors  suggested  that  duration  plays  a  signifi- 
cant role  in  the  aural  speaker  identification  process; 


primarily  it  allows  larger  samples  of  the  speaker's  repertoire 
to  be  tested.     Indeed,  results  of  the  voiced  samples  showed 
that  duration  was  important  to  identification  only  up  to 
1200  milliseconds,  beyond  that  point  no  further  improvement 
in  performance  was  noted.     The  authors  also  investigated  the 
effects  of  low-  and  high-band  pass  filtering  on  listener 
performance.     The  findings  from  this  portion  of  the  Pollack 
et  al.  study  demonstrated  that  correct  identification  is  not 
critically  dependent  upon  any  delicate  balance  of  frequency 
components,  in  any  single  portion  of  the  spectrum. 

Compton  (1963)   also  examined  the  effects  of  filtering 
and  duration  on  the  aural  identification  process.     He  used 
sustained  productions  of  the  vowel  /i/  while  varying  duration 
and  filtering  conditions.     He  reported  that  duration  is  a 
factor  in  listener  performance,  especially  that  levels  of 
correct  identification  increased  with  lengthening  of  the 
sample  up  to  a  duration  of  about  1250  milliseconds.  This 
finding  is  close  in  agreement  with  those  reported  by 
Pollack  et  al.    (1954) .     Compton  also  found  that  if  the 
speech  sample  was  filtered  above  the  frequencies  of  1020 
Hz,   speaker  identification  rates  were  substantially  reduced 
but  filtering  below  1020  Hz  appeared  to  have  no  significant 
effect  on  the  identification  performance. 
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Bricker  and  Pruzansky  (1966)   investigated  the  aural- 
perceptual  effects  of  both  duration  and  content  on  speaker 
identification  by  listening.     Ten  male  talkers  recorded  five 
types  of  speech  samples:     excerpted  vowels,  excerpted  con- 
sonants-vowel (CV)   sequences,  monosyllable  words,  disyllabic 
nonsense  words,  and  sentences.     Listeners  exhibited  the 
highest  correct  scores  for  samples  of  the  greatest  length 
(sentences)   and  identification  performance  decreased  as 
length  of  sample  decreased.     Moreover,  better  listener  per- 
formances were  obtained  with  CV  speech  samples  than  with  vowel 
excerpts  of  equal  duration.     The  experimenters  inferred  that 
the  number  of  phonemes  within  a  speech  sample  was  of  greater 
importance  to  identification  than  its  absolute  duration. 

In  the  second  part  of  their  experiment,  Bricker  and 
Pruzansky  utilized  the  vowels  /i/  and  /a/  as  experimental 
stimuli  to  study  the  effects  of  content  in  aural  identifica- 
tion.    Their  results  indicate  that  speaker  recognition  is 
not  independent  of  the  utterance.     They  also  found  that  a 
confusion  matrix  for  talker  identification  was  not  symmetri- 
cal.    That  is,  talker  A  may  be  confused  with  talker  B  but 
talker  B  may  not  be  confused  with  talker  A.     These  findings 
raise  some  interesting  questions  about  the  decision  criterion 
utilized  by  human  listeners  determining  a  given  speaker's 
identity. 
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As  one  aspect  of  a  much  larger  study,  Stevens  et  al. 
(1968)   investigated  aural  speaker  identification  using  the 
vowels  /i/  and  /a/as  speech  stimuli.     The  talkers  utilized 
in  this  study  were  homogeneous  with  respect  to  their  voice; 
each  read  isolated  words  from  which  the  experimental  vowels 
were  extracted.     Identifications  based  on  the  aural  portion 
of  this  study  demonstrated  that  a  word  with  the  front  vowel, 
/i/,  was  a  better  cue  to  the  identification  of  the  speaker 
than  a  word  containing  the  back  vowel  /a/.     The  authors  sug- 
gested that  the  higher  second  formant  of  /i/  might  have  aided 
in  the  improved  listener  performances. 

lies     (1972)  also  found  that  speaker  identification 
scores  varied  when  different  vowels  were  used  as  speaker- 
identifying  features.     This  study  involved  six  speakers  and 
16  listeners.     Speech  stimuli  for  the  listeners  was  excerpted 
from  a  passage  read  by  the  talkers;   it  consisted  of  several 
sentences  and  four  different  vowels.     Among  other  things,  lies 
found  that  speaker  differentiation  cues  were  present  to  a 
greater  degree  in  the  low  vowels.     She  concludes  that  the 


■"■(The  main  purpose  of  this  study  was  to  compare  aural 
and  visual  methods  of  speaker  identification.     The  aural 
method  produced  higher  identification  scores  than  the  visual 
method  in  all  portions  of  this  study.) 
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first  formant  may  contain  some  idiosyncratic  characteristics 
which  aid  in  speaker  identification.     However,   lies'  results 
are  somewhat  in  variance  with  those  of  Stevens  et  al.  (1968), 
but  the  manner  of  presentation  may  have  had  a  differential 
affect  on  the  results  of  the  two  studies . 

Clarke  and  Becker   (1969)  used  a  rating  system  to  study 
aural  speaker  identification.     In  this  study,  each  listener 
rated  the  talker  on  six  different  scales:     pitch,  pitch 
variability,  rate,  click-like  elements,  sibilant  intensity 
and  breathiness.     The  listeners  used  a  seven  point  scale  for 
pitch  and  five-point  scales  for  the  other  variables.  Identi- 
fications were  made  utilizing  each  scale  singly  and  then  in 
all  possible  permutations.     Pitch  was  found  to  be  the  most 
effective  speaker  discrimination  characteristic  when  it  was 
used  singly  to  determine  a  speaker  identity.  Click-like 
elements,   sibilant  intensity,  breathiness,  rate  and  pitch 
variability  followed  in  decreasing  order  of  effectiveness. 

LaRiviere   (1975)   studied  the  role  of  voice  source  in, 
and  the  effects  of  vocal  tract  transfer  characteristics  on, 
speaker  identification.     He  used  voiced,  whispered  and  low- 
pass  filtered  vowels  in  order  to  examine     (1)  source 
information  (filtered),    (2)   vocal-tract  information 
(whispered),  and  (3)  both  (voiced).     Using  these  three 
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sample  conditions,  LaRiviere  found  the  whispering  (vocal 
tract)  and  low-pass   (source)   samples  resulted  in  about  equal 
correct  identification  scores.     In  addition,  he  found  that 
the  summed  scores  for  the  whispering  and  low-pass  filtering 
conditions  were  about  equal  to  the  scores  for  the  voiced  full- 
band  condition.     The  author  concluded  that  both  voice  source 
and  vocal  tract  characteristics  were  of  equal  importance  to 
speaker  identification  and  seemed  to  be  communicating 
different  information  to  the  listener. 

A  study  examining  the  effects  of  stress  and  disguise  on 
perceptual  identification  of  talkers  was  carried  out  by 
Hollien,  Majewski,  and  Hollien  (1974) .     Adult  male  talkers 
read  an  extended  prose  passage  under  the  following  conditions 
(1)  normal  speech,    (2)   stress   (talkers  were  subjected  to 
randomly  distributed  electric  shock  while  speaking) ,  and 
(3)  disguised  speech.     Three  types  of  listeners  were  utilized 
(1)   listeners  who  knew  the  talkers,    (2)   listeners  who  did  not 
know  the  talkers,  and  (3)    listeners  who  neither  knew  the 
talkers  nor  the  language   (i.e.,  native  speakers  of  Polish 
who  did  not  know  English) .     As  would  be  expected,   the  results 
of  this  study  revealed  that  the  group  who  knew  the  talkers 
did  best  under  all  speaking  conditions,  with  the  other  two 
groups  exhibiting  poorer  scores.     The  authors  concluded 
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that  exposure  to  a  talker  aids  in  the  speaker  identifica- 
tion process.     By  examining  across  speaking  condition,  rather 
than  listener  type,  it  was  found  that  stress   (as  utilized  in 
this  study)  had  little  effect  on  the  speaker  identification 
task.     However,  disguise  greatly  imparled  the  listener's 
ability  to  identify  the  talkers.     The  same  speaking  condition 
trends  were  consistent  for  all  three  listener  types. 

Finally,  in  a  recent  study,  Rothman  (1977)  investigated 
the  effects  of  non-contemporary  speech  samples  on  aural  speaker 
identification.     Pairs  of  speakers  were  chosen  for  their  vocal 
similarities.     Some  were  father-son  combinations;  some  were 
brothers  or  twins;  still  others  were  simply  sound-alikes . 
These  subject  pairs  all  had  a  long  history  of  being  confused 
with  one  another.    Rothman  recorded  the  talkers  twice — at  one 
week  intervals.     These  recorded  samples  were  presented  to 
listeners  in  two-second  speech  segments  for  each  pair  of 
talkers.     Same  or  different  talker  judgements  were  made  by 
the  listeners  under  the  following  conditions;      (1)  same 
talker/contemporary  sample;    (2)   same  talker/noncontemporary 
sample;  and   (3)  different  talkers.     The  results:  (1) 
speakers  paired  with  their  own  contemporary  sample  were 
identified  94  per  cent  of  the  time;    (2)   speakers  paired 
with  their  own  noncontemporary  sample  were  identified  42 
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per  cent  of  the  time;  and  (3)   speakers  paired  with  the 
claimed  vocal  twin  were  correctly  identified  as  themselves 
58  per  cent  of  the  time.     The  fact  that  the  per  cent  correct 
identifications  is  greater  for  group  3  than  for  group  2 
(58%  vs.  42%),  would  imply  that  listeners  can  detect  some 
idiocyncratic  cues  which  aid  in  the  determination  of  a 
speaker.     The  results  also  indicated  that  within  the  con- 
straints of  the  population  utilized,   i.e.,  adult  males  chosen 
for  similarity  of  their  voices,  time  appeared  to  have  played 
the  most  important  role  in  aural  speaker  identification.  It 
would,  therefore,  appear  that  even  when  recordings  are  made 
only  one  week  apart,  aural  identification  of  speakers  is 
greatly  impaired. 

To  summarize  the  research  on  the  aural/perceptual 
approach  to  speaker  identification  approach,  the  following 
relationships  have  been  observed. 

A.  Duration.     Some  evidence  suggests  that  speaker 
identification  may  be  largely  a  function  of  the 
absolute  duration  of  an  utterance.     More  recent 
studies,  however,   suggest  that  duration  is 
important  only  insofar  as  it  allows  listeners 
to  sample  a  larger  repertoire  of  the  talker's 
speech  behavior. 

B.  Fundamental  Frequency.     There  is  evidence  that 
the  fundamental  frequency  of  a  speaker  plays  an 
important  role  in  speaker  identification. 
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C.  Formant  Frequency.     The  reported  research  suggests 
that  there  is  a  relationship  between  formants 
(especially  the  second ' formant)   and  speaker 
identification.     However,  these  relationships  are 
not  consistent  over  individuals. 

D.  Phoneme  Effects.     Speaker  confusion  appears  to 
vary  with  the  phoneme,  however;  this  confusion 
also  varies  with  specific  vowels  as  well  as  voice 
inflections  and  consonant-vowel  sequence. 

E.  Speaker  Conditions.     Speakers  have  the  ability  to 
disguise  their  voices  and  considerably  reduce 
correct  identification,  even  if  they  are  familiar 
to  the  listeners.     However,  there  is  some  evidence 
to  suggest  that  speakers  recorded  under  physical 
stress  are  not  much  more  difficult  to  identify  than 
normally  recorded  speakers. 

F.  Contemporaneousness .     There  also  is  evidence  to 
indicate  that  if  speech  samples  are  noncontemporary , 
aural/perceptual  speaker  identifications  are 
greatly  impaired. 

G.  Familiarity.     The  research  reports  demonstrate  that 
exposure  of  a  talker  to  a  listener  plays  an  impor- 
tant role  in  aural  speaker  identification.  Also 
evidence  indicates  that  knowledge  of  the  speaker's 
language  aids  in  the  identification  process. 

Spectrographic  Speaker 
Identification 


A  second  method  by  which  speaker  identification  has 
been  investigated  utilizes  the  frequency-time-intensity 
sound  spectrograph.     The  sound  spectrograph  was  originally 
developed  at  the  Bell  Telephone  Laboratory  primarily  for 
the  purposes  of  teaching  the  deaf  to  speak   (see  Potter 
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et_al.,   1966).     However,  in  1944,  Gray  and  Kopp  discussed 
the  identification  of  speakers  by  visual  inspection  of 
spectrograms  and  concluded  that  this  method  appeared  to 
offer  good  potential  for  such  application. 

Later,  Kersta  (1962)  proposed  that  speech  spectrograms 
(voiceprints)  contained  cues  which  would  enable  observers 
to  identify  speakers  from  spectrograms  of  their  utterances 
alone.     In  his   "investigation,"  talkers  apparently  produced 
ten  monosyllabic  words  in  isolation,  spectrograms  were  made 
of  each  word.     The  observers   (young  females)  were  given  a 
five-day  training  period  in  which  they  were  taught  to 
identify  speakers  from  their  spectrograms   (the  exact  nature 
of  this  training  was  not  specified) .     Upon  completion  of 
their  training,  these  observers  were  required  to  match 
known  spectrograms  with  unknown  spectrograms.  Kersta 
reports  that  the  results  of  chis  matching  technique  yielded 
a  99  per  cent  correct  identification  rate. 

Young  and  Campbell   (1967)    attempted  to  replicate  ele- 
ments of  the  Kersta  research.     They  used  the  same  ten 
monosyllabic  words  Kersta  used,  both  in  isolation  and  in 
context.     Sound  spectrograms  were  made  of  each  word  in  a 
fashion  identical  to  that  employed  by  Kersta.     The  observers 
utilized  in  this  study  also  went  through  a  training 
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procedure.     Then  they  were  presented  with  spectrograms  of 
known  and  unknown  speakers  for  identification.     The  results 
indicate  that  words  in  isolation  produce  much  higher  correct 
identification  scores  than  words  in  context   (78.4%  in  isola- 
tion,  37.3%  in  context).     There  scores  are  also  substantially 
different  from  those  reported  by  Kersta.     Young  and  Campbell 
(1967)  concluded  that  one  reason  for  the  difference  between 
the  two  scores   (isolation  and  context)  may  be  the  duration 
of  the  word.     In  isolation,  the  words  tended  to  be  much 
longer  in  duration  than  when  put  within  the  context  of 
several  words.     The  authors  also  inferred  that  different 
contexts  change  the  speaker  identification  scores. 

Referring  back  to  the  Stevens  et  al.    (1968)   study  dis- 
cussed earlier,   it  may  be  observed  that  they  also  investi- 
gated the  spectrographic  speaker  identification  procedures 
by  means  of  open  and  closed  sets.     The  speech  samples  used 
were  monosyllables,  disyllables,  and  phrases.     In  the 
closed  set  tests,   the  correct  match  was  always  present, 
while  in  open  set  tests,   the  correct  match  might  or  might 
not  have  been  present.     The  closed  set  tests  yielded  a  mean 
identification  score  of  79  per  cent;  the  open  sets,  47  per 
cent  identification.     The  results  of  the  Stevens  et  al. 
study  appear  to  be  in  fair  agreement  with  Young  and 
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Campbell   (1967).     However,   these  results  are  not  in  agree- 
ment with  the  results  reported  by  Kersta  (1962) . 

In  1972,  Tosi  et  al.  reported  a  comprehensive  labora- 
tory study  that  examined  the  spectrographic  method  of  speaker 
identification.     These  authors  had  speakers  produce  words 
in  isolation,  in  fixed  context,  and  in  random  context.  The 
observers  for  this  study  were  given  a  month  of  training  in 
"phonetics  and  spectrogram  matching  procedures."     In  order 
to  proceed  to  the  more  difficult  tasks,  each  observer  had 
to  reach  an  identification  score  of  96  per  cent  correct  for 
closed  set  tasks.     The  more  difficult  experimental  trials 
consisted  of  open  sets,  words  in  context,  and  non-contemporary 
samples.     It  was  found  that  closed  sets  produced  lower  error 
rates  than  open  sets   (5.5%  vs.   9.9%),  and  that  contemporary 
samples  were  more  identifiable  than  non-contemporary  samples 
(4.8%  vs.  12.1%  errors).     In  addition,  the  error  rates  for 
words  in  isolation,   in  fixed  context  and  in  random  context 
were  4.2,  7.6,  and  13.4  per  cent,   respectively.     The  results 
of  this  study  indicate  that  visual  speaker  identification 
performance  is  best  for   (a)  words  in  isolation,    (b)  closed 
sets,  and  (c)   contemporary  speech  sample.     Tosi  et  al. 
concluded  that  from  the  data  in  this  experiment  the 
spectrographic  method  of  speaker  identification  demonstrates 
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interspeaker  variability  as  greater  or  different  than  the 
intraspeaker  variability.     The  authors  also  state  that 
their  results  confirm  the  findings  of  Kersta   (1962) . 

In  a  contrasting  study,  Hazen  (1973)   examined  speaker 
identification  in  open  and  closed  sets  under  differing  con- 
textual conditions.     For  speech  samples,  Hazen  utilized  60 
speakers  recorded  while  talking  spontaneously.     Four  cue 
words  were  then  isolated  from  the  speech  sample  and 
spectrograms  made  of  each  word.     The  matching  or  identifi- 
cation procedure  was  done  oy  subjects  who  were  trained  in 
the  same  manner  as  those  receiving        Kersta' s  "Voiceprint 
Identification  Training  Course."    Results  at  all  levels  of 
this  experiment  demonstrated  higher  errors  rate  than  either 
Kersta  (1962)  or  Tosi  et  al.    (1972) .     Hazen  concluded  that 
spectral  similarities  due  to  intra-speaker  consistency  were 
not  apparent  enough  to  outweigh  the  similarities  due  to  a 
different  phonetic  context. 

Hollien  and  McGlone   (1976)   examined  the  effects  of 
disguise  on  the  spectrographic  approach  to  speaker  identi- 
fication.    The  talkers  they  utilized  were  instructed  to 
read  an  extended  prose  passage  in  their  normal  voice  and 
then  repeat  the  reading  using  a  "disguised"  voice.  The 
auditors  in  this  experiment  consisted  of  faculty  and  a 
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graduate  student  in  phonetic  sciences.     All  were  well 
acquainted  with  the  use  of  spectrograms  for  speaker  identi- 
fication purposes.     These  skilled  auditors  were  asked  to 
match  the  disguised  sample  spectrograms  with  a  spectrogram 
made  of  the  same  speaker  in  his  normal  voice.     The  average 
score  of  these  observers  was  25  per  cent  correct  identifi- 
cation.    The  authors  concluded  that  the  disguised  condition 
greatly  affected  spectrographic  speaker  identification.  In 
a  later  experiment,  McGlone,  Hollien,  and  Hollien  (1977), 
demonstrated  the  possible  reasons  why  the  spectrographic 
method  of  speaker  identification  is  not  able  to  recognize 
an  individual  who  is  disguising  his  speech.  Specifically, 
these  authors  found  variations  in  speaking  fundamental 
frequency,  formant  frequencies,  and  formant  bandwidths . 
They  also  found  the  duration  of  speech  samples  were 
generally  greater  for  the  disguised  voice  than  for  the 
normal  voice.     These  results  imply  that  the  speech  of  a 
talker  can  be  altered  significantly.     They  concluded  that 
until  these  acoustic  alterations  can  be  generalized  for 
the  whole  population  and  their  effects  predicted,  the 
spectrographic  approach  to  speaker  identification  would 
appear  to  be  inadequate  for  use  as  a  practical  speaker  iden- 
tification system. 
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Reich,  Moll,  and  Curtis   (1976)  designed  an  experiment 
to  investigate  the  effects  of  vocal  disguise  upon  spectro- 
graphs speaker  identification.     Forty  male  speakers  pro- 
vided speech  samples  on  two  separate  occasions — one  week 
apart.    The  samples  consisted  of  several  words  excerpted 
from  different  sentences.     This  sample  arrangement  was  analo- 
gous to  the  non -con temporary,  random  context  of  Tosi  et  al. 
(1972) .    The  speaking  modes  utilized  were:     (1)  normal, 
(2)  old-age  disguise,    (3)   hoarse  disguise,    (4)  hypernasal 
disguise,    (5)   slow-rate  disguise,  and  (6)   free  disguise. 
The  examiners  were  Ph.D.  candidates  in  the  speech  sciences 
program  at  the  University  of  Iowa  and  were  extensively 
trained  in  the  use  of  spectrograms  for  speaker  identifi- 
cation.    Matching  the  undisguised  test  spectrograms  to  the 
undisguised  reference  spectrograms,  the  examiners  correctly 
identified  56.7  per  cent  of  the  talkers.     When  the  dis- 
guised spectrograms   (all  types)  were  matched  to  the  undis- 
guised reference  samples,  only  33  per  cent  of  the  speakers 
were  correctly  identified.     The  authors  conclude  that: 
(1)  the  type  of  disguise  affects  the  degree  to  which 
spectrographic  speaker  identifications  can  be  made,  (2) 
certain  speakers  are  more  difficult  to  identify,  and  (3) 
the  findings  of  this  study  do  not  substantiate  the  prior 
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claims  that  spectre-graphic  speaker  identification  is 
unaffected  by  attempts  at  vocal  disguise. 

Houlihan  (1977)  also  examined  spectrographs  speaker 
identification  of  disguised  voices.     She  carried  out  two 
related  experiments;  only  the  second  will  be  reviewed.  The 
speakers   (eight  females  and  eight  males)   in  this  experi- 
ment each  produced  sentences  in  five  speaking  modes:  un- 
disguised,  lowered  fundamental  frequency,  falsetto, 
whispered,  and  muffled.     The  mean  identification  score  for 
the  undisguised  condition  was  85.5  per  cent   (71  per  cent 
for  females  and  100  per  cent  for  males) .      However,  for  the 
disguised  conditions,  they  exhibited  overall  an  identifi- 
cation score  of  27  per  cent   (28.5  per  cent  for  females  and 
25  per  cent  for  males) .     Houlihan's  results  demonstrate 
that  female  speakers  are  not  more  difficult  to  identify  than 
male  speakers  under  a  normal  or  undisguised  condition.  The 
results  of  this  experiment  also  confirm  earlier  studies 
that  the  disguised  voice  confounds  the  spectrographic  method 
of  speaker  identification. 

Finally,   in  the  second  part  of  a  study  previously 
reviewed,  Rothman  (1977)   investigated  the  effects  of 
similar  sounding  talkers  on  the  spectrographic  approach 
to  speaker  identification.     As  described  earlier,  two 
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recorded  speech  samples   (separated  by  one  week)  were  pro- 
duced by  twelve  talkers.     Spectrograms  were  made  of  these 
samples  and  were  presented  to  examiners  for  identification. 
Rothman  found  that  contemporary  phrases  produced  identifi- 
cation rates  of  24  per  cent  while  noncontemporary  samples 
were  considerably  lower;  about  6  per  cent  correct  identifi- 
cation.    On  the  basis  of  these  data,  he  suggests  that  within 
the  constraints  of  the  study,  time  of  utterance  is  a  very 
important  factor  for  speaker  identification  purposes.  The 
author  also  concluded  that  the  aural/perceptual  approach  to 
speaker  identification  is  a  significantly  better  approach 
than  is  the  spectrographic  method. 

In  summary,  while  some  studies   (notable  those  of 
Kersta  and  Tosi)  show  high  identification  rates  utilizing 
the  spectrographic  method  for  speaker  recognition,  most 
of  the  reported  research  does  not  confirm  these  high 
scores .     The  bulk  of  the  literature  reviewed  above  demon- 
strates that  there  are  many  factors  which  seriously  affect 
the  identifications  made  from  spectrograms.     Some  of  these 
factors  are:      (1)   the  effect  of  training  of  examiners; 
(2)  whether  or  not  the  speaker  is  altering  his/her  voice 
in  any  way;    (3)  the  phonemic  context  of  the  utterance; 
(4)  whether  or  not  the  utterance  to  be  identified  is 


23 


recording  in  a  contextual  setting  or  in  isolation;  (5) 
utterance  duration;   (6)  whether  or  not  identification  trials 
are  in  open  or  closed  sets;  and  (7)  whether  or  not  the 
sample  recordings  were  contemporary  or  noncontemporary . 
As  McGlone  et  al.    (1977)  have  pointed  out,   it  seems 
apparent  from  the  literature  that  spectrographs  repre- 
sentation of  the  voice  is  easily  and  greatly  altered  by 
numerous  means.     Therefore,  a  great  deal  of  further  investi 
gation  is  necessary  before  the  spectrographic  approach  can 
be  considered  a  valid  and  reliable  means  of  speaker  identi- 
fication. 

Automatic  or  Semi-Automatic 
Speaker  Identification 

A  number  of  "machine"  approaches  have  been  utilized 
in  the  study  of  the  speaker  identification  process. 
These  approaches  may  be  categorized  in  a  number  of  ways 
including   (1)  method  of  analysis,    (2)   the  statistical 
technique,  or   (3)   the  speech  features  studied.     For  the 
purposes  of  this  review,  the  automated  approaches  will  be 
divided  into  groups  on  the  basis  of  the  acoustic  features 
utilized  as  the  determination  of  a  speaker's  idiosyn- 
cratic speaking  characteristics.     The  features  or 
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parameters  which  nave  been  most  extensively  investigated 
are  spectral  analysis   (long-term  speech  spectra),  funda- 
mental frequency,  formant  frequencies,  and  in  a  few  cases, 
temporal  features.    A  review  of  the  more  relevant  litera- 
ture dealing  with  automatic  speaker  identification  follows. 

Spectral  Analysis 
(Long-term  Speech  Spectra) 

An  early  study  utilizing  spectral  analysis  was  reported 
by  Pruzansky  (1963) .     Spectral  patterns  were  developed 
from  ten  words,  excerpted  from  context,  spoken  by  ten 
talkers   (seven  males  and  three  females) .     The  spectral 
patterns  of  three  utterances  of  the  same  word  by  the  same 
talker  were  used  as  a  reference.     The  remaining  utterance 
of  the  word  was  used  as  the  test.     Product  moment  co- 
efficients of  correlation  were  calculated  between  the 
reference  pattern  and  the  test  pattern.     The  test  and 
reference  sample  patterns  which  were  most  highly  corre- 
lated were  identified  as  being  produced  by  the  same 
speaker.     Utilizing  this  correlation  method  of  identifi- 
cation, Pruzansky  correctly  classified  89  per  cent  of 
393  test  utterances.     She  concluded  that  spectral  dis- 
tinctiveness of  talkers  is  retained  in  long-term  spectra. 
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Majewski  and  Hollien  (1974)  and  Zalewski,  Majewski,  and 
Hollien  (1975)  examined  the  usefulness  of  long-term  speech 
spectra  as  a  cue  for  speaker  identification.     Both  studies 
utilized  the  same  group  of  subjects   (50  Americans  and  50 
Poles),     Each  subject  read  a  prose  passage  as  a  speech 
sample,  from  which  power  spectral  information  was  extracted. 
Majewski  and  Hollien  (1974)  utilized  Euclidean  distances  to 
classify  the  speakers  and  Zalewski  et  al.    (1975)  applied 
cross-correlations  to  the  spectral  data  in  order  to  make 
the  identifications.     The  mean  error  rate  for  both  the 
Hollien  and  Majewski  and  Zalewski  et  al. ,  studies  was  about 
4  per  cent.     In  a  third  study  Doherty  (1976)   used  the  50 
American  speakers  and  the  same  feature  extraction  tech- 
niques as  the  two  previous  studies.     However,  he  applied 
discriminant  analysis  in  order  to  recognize  the  speaker 
patterns.     Utilizing  the  same  full  bandwidth  (80  Hz  to 
12.5  kHz),  no  errors  in  identification  were  found.  Doherty 
at  this  point  limited  the  bandwidth  of  the  spectral  in- 
formation to  315  Hz    and  3.5  KHz,   and  he  recalculated  his 
error  rate.     In  this  case  it  was  24  per  cent.  Doherty 
concluded  that  even  though  error  rates  under  the  bandpass 
condition  were  encouraging,  the  24  per  cent  was  unaccept- 
able for  most  practical  applications.     It  should  also  be 
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noted  that,  while  all  of  these  studies   (Majewski  and 
Hollien,   1974;  Zalewski  et  al. ,   1975;  and  Doherty,  1976) 
used  basically  similar  populations,  each  utilized  a  dif- 
ferent statistical  technique.     In  other  words,  Doherty  also 
demonstrated  that  the  selection  of  statistical  technique 
is  of  importance  in  the  automatic  approach  to  speaker 
identification . 

In  order  to  test  the  resistance  of  spectral  analysis 
to  distorted  speech,  Hollien  and  Majewski   (1977)  studied 
25  talkers  who  produced  speech  under  psychologically 
stressful  and  disguise     conditions.     Power  spectral  informa- 
tion was  extracted  for  all  talkers  in  both  full  band   (80  Hz 
to  17.5  kHz)   and  limited  band  (315  Hz  to  3.5  KHz),  and 
euclidean  distances  calculated  for  the  talker's  individual 
spectral  data.     Under  the  stress  condition,   92  per  cent 
correct  identification  was  achieved  for  the  full  bandwidth 
and  68  per  cent  identification  with  the  limited  bandwidth. 
The  disguised  condition  yielded  considerably  lower  scores, 
20  per  cent  correct  with  full  bandwidth  and  10  per  cent 
identification  with  limited  bandwidth.     Doherty  and 
Hollien   (1977)   also  examined  speaker  identification  of 
distorted  speech.     The  authors  employed  the  same  talkers, 
speech  samples  and  feature  extraction  technique  as  those 
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employed  by  the  previous  study.     However,  in  this  case 
discriminant  analysis  was  used  as  the  statistical  procedure. 
Results  of  this  study  were  72  per  cent  correct  identifi- 
cation for  the  stress  condition  and  24  per  cent  correct  for 
the  disguise.     The  results  of  both  studies  demonstrate  that 
stress  has  little  effect  on  the  recognition  of  a  speaker, 
while  disguised  voices  are  much  harder  to  identify  than 
undisguised  voices.     The  conclusion  that  may  be  made  is 
that  disguise,  not  stress,  greatly  alters  the  spectral 
characteristic  of  a  talker's  voice. 

Fundamental  Frequency 

In  order  to  examine  another  element  within  the  fre- 
quency domain,  Atal  (1972)   investigated  pitch  contours 
(fundamental  frequency  variations)  as  a  cue  to  speaker 
identification.     Atal  formed  a  20-dimensional  vector  based 
on  the  pitch  data  of  ten  speakers  and  utilized  linear 
transfer  functions  to  maximize  the  ratio  of  the  inter- 
speaker  and  intraspeaker  variations  of  these  pitch  con- 
tours.    From  this  data,  reference  and  test  vectors  were 
formed  and  Euclidean  distances  calculated  between  these 
vectors.     By  matching  uhe  reference  and  test  vectors  with 
the  smallest  distance,  the  author  was  able  to  achieve  an 
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identification  rate  of  98  per  cent.     On  the  basis  of 
these  results,  Atal  concludes  that  the  pitch  contours  are 
useful  acoustic  features  in  a  speaker  recognition  system. 

In  1972,  Wolf  used  several  classes  of  frequency  data 
(fundamental  frequency  and  features  of  vowel  and  consonant 
spectra)  as  speaker  identification  cues.     The  author  had 
21  male  speakers  record  six  sentences  whereupon  the  fre- 
quency information  was  extracted  from  the  recordings  and 
a  linear  classification  procedure  applied.     Wolf  did 
not  analyze  his  parameters  separately  but  did  state  that 
fundamental  frequency  was  a  very  useful  parameter  in  his 
identification  paradigm.     However,  he  also  states  that 
fundamental  frequency  is  perhaps  the  easiest  and  most 
obvious  acoustic  feature  to  modify  in  vocal  disguise.  In 
a  similar  study,  Sambur   (1975)  utilized  the  same  speech 
material  and  features  as  those  of  Wolf  (1972).  However, 
the  Sambur  recording  sessions  spanned  a  period  of  three 
and  one-half  years  and  included  eleven  talkers.  Average 
fundamental  frequency  was  not  actually  tested  in  this 
recognition  experiment.     However,  by  use  of  probability 
of  error  criterion,  Sambur  ranked  average  fundamental 
frequency  twelfth  among  38  possible  recognition  features. 
These  calculations  indicate  that  fundamental  frequency 
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can  be   (possibly)  a  useful  cue  in  recognizing  an  unknown 
speaker . 

In  referring  back  to  the  Doherty  (1976)  study,  it  can 
be  found  that  he  also  extracted  speaking  fundamental  fre- 
quency from  the  speech  of  his  50  speakers.     Using  only  a 
two  parameter  vector  to  specify  fundamental  frequency,  a 
30.2  per  cent  correct  identification  score  was  achieved. 
This  vector  was  also  used  in  conjunction  with  limited  band- 
width long-term  speech  spectra.     The  combination  of  the 
two  vectors  yielded  an  identification  score  of  97.7  per 
cent.    As  an  overall  conclusion,  Doherty  states  that  funda- 
mental frequency  does  contain  useful  idiosyncratic  data 
and  appears  to  be  independent  of  the  information  carried  in 
long-term  spectra. 

In  the  Doherty  and  Hollien  (1978)   study  reported 
previously,  the  authors  also  used  fundamental  frequency 
as  a  cue  to  identify  the  25  talkers.     In  this  case,  the 
individuals  who  produced  stressed  speech  samples  were 
correctly  recognized  30  per  cent  of  the  time  but  the  dis- 
guised samples  only  10  per  cent  of  the  time.     Since  results 
of  the  stress  condition  are  in  agreement  with  those  of 
Doherty  (1976),  it  may  be  concluded  that  stress,  of  the 
type  examined  in  these  experiments,  at  least,  does  not 
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greatly  alter  speaking  fundamental  frequency.     The  poor 
identification  scores  for  the  disguise  condition  would 
seem  to  indicate  that  speaking  fundamental  frequency  is 
greatly  changed  when  a  speaker  attempts  to  disguise  his 
voice.     This  relationship  confirms  the  statement  made  by 
Wolf  (1972)  about  disguised  voices. 

Formant  Frequency 

The  resonances  or  formant  frequencies  of  vowels  and 
consonants  have  also  been  investigated  as  cues  to  speaker 
identification.     In  a  study  described  earlier,  Wolf  (1972) 
extracted  selected  formants  from  his  21  male  talkers. 
Utilizing  17  parameters,  of  which  six  were  formant  fre- 
quency, 100  per  cent  recognition  was  achieved.    Wolf  con- 
cluded the  vowel  and  consonant  spectral  information  is  use- 
ful in  the  classification  of  speakers.     In  a  study  also 
evaluated  earlier,  Sambur   (1975)  reports  four  measure- 
ments of  the  same  type.     They  were:      (1)   the  second  formant 
of  /n/,    (2)   the  third  formant  of  /u/,    (3)  the  second 
formant  of  /i/,  and   (4)   the  third  formant  of  /m/. 
Utilizing  these  four  measurements  and  one  temporal 
measurement,  only  one  identification  error  in  320  trials 
was  made.     These  results  lead  Sambur  to  conclude  that 
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formant  information  was  among  the  most  useful  in  recog- 
nizing a  speajcer  from  his  voice  alone. 

Finally,  Goldstein  (1976)   evaluated  ten  vowel  formant 
structures  as  speaker  identifying  features .     Ten  adult 
males  were  recorded  reading  sentences  containing  key 
words.     The  formant  tracts  were  extracted  from  these  key 
words;  199  measurements  were  made  and  evaluated  for 
effectiveness  in  speaker  identification.     In  an  identifi- 
cation experiment  using  cwo  formant  measurements,  only  12 
errors  were  made  in  80  identifications.  Furthermore, 
Goldstein  evaluated  five  formant  measurements  with  a  tech- 
nique called  probability-of -error   (described  by  Sambur, 
1975) .    Utilizing  this  technique  an  error  rate  of  0.25  per 
cent  was  calculated  for  the  five  measurements .     Based  on 
the  findings  in  this  study,  Goldstein  concludes  that,  since 
certain  formant  measures  demonstrate  large  speaker 
differences,  the  variations  must  jje  dependent  more  upon 
the  speaker's  habits  than  on  vocal-tract  configuration. 
In  addition,  she  states  that  it  is  possible  that  rirst 
and  second  formant  measures  contain  more  information  than 
just  vocal-tract  length  information. 
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Temporal  Parameters 

Temporal  features  within  the  speech  signal  have  not 
been  examined  for  speaker  dependency  in  the  same  detail  as 
have  frequency  features.     However,  in  a  study  described 
earlier,  Pruzansky  (1963)  tested  the  effect  of  only  utilizing 
temporal  data  on  speaker  recognition  success.     In  brief,  she 
used  ten  talkers  who  recorded  several  repetitions  of  ten 
words  in  context.     Two-dimensional  patterns,  consisting  of 
the  total  energy  in  time  segments,  were  formed  by  summing  the 
energy  over  the  several  frequency  bands  for  each  time 
section.    With  a  pattern  matching  technique,  the  temporal 
patterns  yielded  a  47  per  cent  correct  identification  score. 
Pruzansky  concluded  from  her  results  that  the  temporal 
patterns  were  more  correlated  to  the  individual  words  than 
to  the  speakers . 

In  the  Wolf   (1972),  Sambur   (1975),  and  Goldstein  (1976) 
studies,   limited  temporal  measurements  were  evaluated  for 
possible  speaker  dependent  characteristics.     Wolf  (1972) 
measured  the  total  duration  of  the  word  "bought."    He  states 
that  certain  learned  idiosyncratic  voice  characteristics 
possibly  deal  mainly  in  timing.     However,  his  single  example 
of  a  gross  timing  measure  did  not  provide  useful  identification 
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information,  that  is,  in  comparison  to  his  spectral  measure- 
ments.    Goldstein  reports  similar  results  for  her  timing 
measurements.     She  investigated  the  duration  of  several 
formant  tracks  and  their  ratios.    None  of  the  temporal 
parameters  were  included  in  the  final  identification  analysis 
because  of  low  inter-speaker  variations.     In  contrast,  Sambur 
found  temporal  measures  useful  in  speaker  identification. 
He  made  two  measurements:     (1)  the  slope  of  the  second 
formant  of  the  diphthong  /ai/  and  (2)  the  duration  of  the 
frication  and  aspiration  noise  of  the  plosive  A/  in  "cash." 
Both  these  features  were  among  Sambur' s  10  most  effective 
speaker  identification  cues. 

In  the  Doherty  (1976)  experiment  previously  discussed, 
two  temporal  parameters  were  examined  for  use  in  speaker 
recognition.     These  two  measurements  were  speaking  time 
and  phoneme  rate.    Utilizing  this  two-parameter  vector, 
only  six  of  the  total  50  speakers  were  correctly  identified. 
However,  when  the  temporal  vector  was  added  to  his  other 
vectors   (speaking  fundamental  frequency  and  long-term 
speech  spectra) ,  the  identification  rates  increased  from 
8  per  cent  to  26  per  cent.     This  finding  would  suggest  that 
speaker  dependent  information  within  spectral  features  was 
different  from  those  contained  in  the  temporal  elements 
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of  the  speech  signal.     These  two  temporal  features,  speaking 
time  and  phoneme  rate,  also  were  examined  under  stress  and 
disguise  speech  conditions   (Doherty  and  Hollien,   1978) . 
The  identification  scores  for  this  study  were  very  low 
(12  per  cent  for  the  stress  and  16  per  cent  for  the  disguise) . 
In  addition,  when  these  temporal  features  were  combined 
along  with  the  spectral  vectors,  correct  identification 
rates  only  increased  from  4  per  cent  to  10  per  cent.  These 
results  suggest  there  is  not  enough  idiosyncratic  information 
within  these  temporal  measurements  to  justify  their  use 
alone.     However,  the  results  also  indicate  that  the  temporal 
elements  within  the  speech  signal  contain  some  identification 
information  that  may  be  insensitive  to  stress  or  disguise. 

In  summarizing  the  experimental  findings  examined  in 
the    preceding  section,  several  conclusions  can  be  drawn 
pertaining  to  the  automatic  or  semi-automatic  approach  to 
speaker  identification.     Initially,   it  can  be  seen  that  the 
spectral  components  of  the  speech  wave  contain  certain 
speaker-dependent  features .     It  should  be  noted  also  that 
when  studies  utilized  similar  spectral  measures,  similar 
results  were  obtained.     Thus,  there  appears  to  be  a  great 
deal  of  consistency  among  these  experiments.     Moreover,   it  was 
found  that  limited  portions  of  the  spectrum,  viz.,  fundamental 
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frequency  and  formant  frequencies,  appeared  to  contain 
idiosyncratic  characteristics  relevant  to  the  speaker  identi- 
fication task.     A  finding  of  special  interest  to  this 
research  is  that  temporal  features  extracted  from  the  speech 
waveform  were  of  value  in  speaker  identification  systems . 
Finally,   it  would  seem  that  "machine"  approaches  to  speaker 
identification  appears  to  exhibit  the  greatest  potential. 
There  are  two  specific  relationships  which  support  this 
conclusion.     First,  when  all  the  research  carried  out  on 
speaker  identification  (aural,  visual,  and  semi-automatic) 
are  compared,  this  approach  has  produced  consistently  the 
highest  and  most  reliable  correct  identification  scores. 
Second,  the  semi-automatic  approach  has,   for  the  most  part, 
removed  subjective  judgements  from  the  identification  process. 

Objectives 

The  primary  aim  of  this  research  was  to  develop  and 
test  a  system  of  inter-speaker  differentiation  which  will 
seek  to  discover  if  certain  speaker  dependent  features  will 
remain  relatively  invariant  during  both  physical  and 
psychological  modifications  to  the  acoustic  environment. 
Specifically,  this  study  examined  selected  temporal  parameters 
which  exist  within  the  speech  waveform.     These  parameters 
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were:     (1)  durational  measurements  of  energy  at  several 
levels  of  magnitude,    (2)  presence  of  vocal  activity,  (3) 
patterns  of  silence,  and  (4)  the  duration  of  several 
specific  words  and  phrases.     It  also  was  the  aim  of  this 
study  to  investigate  the  effects  of  several  speaking  condi- 
tions upon  the  identification  capabilities  of  the  selected 
temporal  parameters. 

The  specific  goals  of  this  research  may  be  stated  in 
the  form  of  several  questions.     They  are  as  follows: 

A.  Which  of  the  several  selected  temporal  elements, 
found  within  the  speech  signal,  are  invar ient, 
idiosyncratic  characteristics  of  an  individual's 
speaking  repertoire? 

1.  Can  these  temporal  measurements  be  made 
reliably? 

2.  Once  measured,  do  these  parameters  predict 
a  speaker's  identity? 

B.  Will  stressful  or  disguised  speaking  conditions 
significantly  reduce  the  speaker  identification 
capabilities  of  these  selected  temporal 
parameters? 

C.  What  changes  occur  in  the  identification  strength 
of  these  parameters   (or  set  of  parameters)  when 
the  method  of  choosing  a  test  speech  sampxe  is 
altered? 


CHAPTER  II 


METHODS 

As  has  been  stated,  the  aim  of  this  research  is  to 
develop  a  speaker  identification  system  based  on  the 
analysis  of  certain  temporal  features.     In  general,  these 
temporal  measurements  include  durational  analysis  of  (1) 
relative  energy  at  several  levels  of  intensity,    (2)  voiced 
or  voiceless  activity,    (3)  patterns  of  silence,  and  (4) 
specific  words  and  phrases.     These  parameters  were  grouped 
into  several  vectors  consisting  of  from  2  to  40  measure- 
ments and  studied  under  a  variety  of  speaking  conditions. 
These  experiments  were  classified  as       (1)  laboratory-normal 
(IN),    (2)   laboratory-distorted  (ID) ,  and  (3)  semi-field  (SF) . 
Specifically,   in  the  first  experiment  the  temporal  measure- 
ments were  tested  under  "ideal"  laboratory  conditions  to 
the  purpose  of  establishing  baseline  data  on  the  speaker- 
identifying  ability  of  the  selected  features.     In  the 
second  (LD)   experiment,   the  vectors  were  applied  to  a 
speaker  identification  task  under  three  speaker  conditions: 
normal,  stress,  and  disguise.     This  experiment  was  designed 
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to  evaluate  the  effects  of  speaker  distortion  (voluntary  and 
involuntary)  on    the  selected  (temporal)  vectors 1  identi- 
fication capabilities.     The  third  and  final  experiment  can 
be  described  as  a  semi-field  study  (SF)  .     In  this  experiment 
recordings  were  made  of  a  "crime-related"  telephone  message 
and  several  "suspects."    An  attempt  then  was  made  to  identify 
the  "criminal  caller"  from  among  the  "suspects"  in  a  closed 
set  paradigm.     The  purpose  of  this  last  study  was  to  evaluate 
the  temporal  parameters  in  a  quasi-forensic  environment. 
A  detailed  description  of  the  temporal  vectors,  experiments, 
and  the  statistical  analyses  utilized  follows. 

Temporal  Parameters 

From  all  the  possible  temporal  characteristics  which 
may  exist  within  the  speech  signal,   four  general  sets  of 
vectors  have  been  chosen.     They  are:      (1)  time-energy 
distribution  (TED) ,    (2)   voiced/voiceless  speech  time 
(V/VL) ,    (3)   vowel/consonant  duration  ratios,  and   (4)  word 
and  phrase  durations   (WPD) .     These  vectors  were  chosen  on 
the  basis  of  two  factors:     (1)   their  potential  to  contain 
information  idiosyncratic  to  a  speaker   (based  on  previous 
research)     and  (2)  the  potential  of  obtaining  these 
parameters  from  the  speech  signal. 
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Time -Energy  Distribution  (TED) 

This  temporal  vector  is  based  on  a  group  of  time-by- 
energy  measurements,  none  of  which  have  been  studied 
previously  in  relation  to  speaker  identification.  In 
general  terms,  this  analysis  reflects  the  total  accumulated 
time  a  talker's  speech  intensity  remains  at  a  specific 
energy  level   (relative  to  his  peak  amplitude) .     It  also 
provides  indications  of  the  speaker's  speech  pattern  with 
respect  to  speech  bursts  and  pause  periods. 

For  the  purposes  of  this  research,  the  speech  bursts 
are  defined  as  the  portion  of  the  speech  energy  which  is 
above  a  given  energy  level  and  pause  periods  represent 
those  areas  between  the  speech  bursts  at  a  given  level  when 
the  speech  energy  falls  below  that  given  energy  level. 
Operationally,  the  TED  procedure  was  carried  out  utilizing 
a  resistor-capacitor  circuit  to  generate  an  energy  envelope 
of  the  speech  signal.     This  signal  then  was  digitized  by 
an  analogue  to  digital  conversion  on  a  Digital  Equipment 
Corporation  PDP  8i  minicomputer.     Figure  1  gives  a  block 
diagram  of  the  equipment  configuration.     The  digitized 
sequence  was  analyzed  for  duration  relative  to  ten  inten- 
sity levels  via  specific  programming  written  especially 
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Fig.  1.     Block  diagram  of  the  equipment  utilized  to 

generate  an  energy  envelope  from  the  analogy 
speech  signal  and  extract  the  time-energy 
distribution   (TED)  parameters. 
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for  this  analysis.     That  is,  the  digitized  envelope  was 
partitioned,  functionally,  into  ten  linearly  equal  energy 
levels,  initiating  with  its  peak  amplitude.     Figure  2  gives 
a  graphic  representation  of  a  typical  energy  envelope  as 
it  would  be  partitioned.    Also  the  speech  burst  and  pause 
periods  are  shown  in  this  figure.     The  number  of  speech 
bursts,  mean  and  standard  deviation  of  the  speech  bursts, 
and  the  standard  deviation  of  the  pause  periods  were  computed 
for  each  energy  level.     The  mean  pause  periods  and  number 
of  pauses  were  not  computed  as  they  are  direct  reciprocals 
of  the  speech  bursts.     The  total  number  of  features  measured 
in  this  vector  was  40  (see  Table  1) .     This  vector  was  used 
in  all  three  experiments . 

Voiced/Voiceless  Speech 
Time  Vector  (WL) 

This  vector  is  created  by  combining  the  duration  of  the 
voiced  and  voiceless  portions  of  the  speech  signal.  The 
first  parameter  was  defined  as  the  duration  of  articulated 
speech  during  a  given  speech  sample.     The  second  term  of 
the  WL  vector  consisted  of  the  total  duration  of  phonation 
or  vocal  activity  during  a  speech  sample.     This  specific 
vector  previously  has  not  been  used  in  the  speaker  identifi- 
cation task.     However,  both  voiced  and  voiceless  phonemes 
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Fig.  2.  Schematic  representation  of  a  typical  ene 
envelope  as  generated  from  the  TED  equip- 
ment configuration. 
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Table  1.     Parameters  Measured  and  Investigated  for  Possible 


Utilization  as  the  Time-Energy  Distribution  (TED) 
Vector;  Speech  Bursts  Represent  the  Portion  of 
the  Speech  Energy  Which  Is  Above  a  Given  Relative 
Intensity  Level  and  Pause  Periods  Are  Defined  as 
Those  Areas  Between  the  Speech  Bursts  at  a  Given 
Energy  Level  Where  the  Energy  Falls  Below  the 
Given  Level. 


Parameters 
(Measured  at  each  of 
Ten  Energy  Levels) 


Number  of 
Parameters 


A. 


Speech  Bursts 


1. 
2. 
3. 


Number  of  bursts 
Mean  duration  of  each  burst 
Standard  deviation  of  the  duration 
of  bursts 


10 
10 


10 


B. 


Pause  Periods 


1. 


Standard  deviation  of  the  duration 
of  pauses 


10 


c. 


Total  Number  of  Parameters  Available 


40 
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have  been  studied  by  a  number  of  researchers  in  speaker 
identification  (see,  for  example,  LaRiviere,   1975  and 
Coleman,  1973) .     Their  research  has  indicated  the 
importance  of  both  voiced  and  voiceless  speech  sounds  to 
the  identification  of  talker  for  their  speech. 

Phonation  time  is  obtained  from  the  IASCP  Fundamental 
Frequency  Indicator  (FFI-8)   linked  to  a  PDP8i  minicomputer. 
FFI-8  is  a  digital  readout  fundamental  frequency  tracking 
device.     It  consists  of  a  group  of  successive  low-pass 
filters  with  cutoffs  at  half-octave  intervals  coupled  with 
a  high  speed  switching  circuits  which  are  controlled  by  a 
logic  system.     FFI  produces  a  string  of  pulses  which  are 
delivered  to  a  PDP8i  computer.     These  pulses  mark  the 
boundary  of  a  fundamental  period  from  a  complex  wave.  An 
electronic  clock  marks  the  time  from  pulse  to  pulse  and 
these  values  are  processed  digitally  to  yield  (along  with 
other  data)   the  geometric  mean  frequency  level  and  standard 
deviation  of  the  frequency  distribution.     While  FFI-8  is 
basically  designed  to  extract  fundamental  frequency  from  a 
complex  wave,  it  will  also  calculate  the  duration  of  the 
vocal  activity  (phonation  time) . 

The  second  phase  of  this  procedure  was  to  extract  total 
articulation  time.     Referring  back  to  the  TED  procedure, 
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the  summed  duration  of  the  speech  bursts  of  energy  level  one 
represents  the  total  amount  of  articulated  speech  (see 
Figure  2) .     Utilizing  the  total  articulation  time  and 
phonation  time,  the  voiceless  speech  componant  of  the 
sample  is  represented.     This  two-dimensional  vector  was 
utilized  in  all  three  experiments. 

Vowel/Consonant  Duration 
Ratio  (V/C) 

The  V/C  vector  is  made  up  of  the  ratios     of  the  duration 
of  selected  vowels  to  the  duration  of  their  consonantal 
environments.     For  this  procedure,  a  separate  ratio  was 
calculated  for  each  of  the  following  words:     "good,"  "not," 
"cannot,"  and  "sort."    A  vector  of  this  type  previously  has 
not  been  utilized  in  speaker  identification  research.  How- 
ever, many  researchers  have  demonstrated  the  individual 
importance  of  vowels  and  consonants  in  identification  of 
talkers   (Clarke  and  Becker,  1969;  LaRiviere,   1975;  and 
Glenn  and  Kliener,  1968) . 

The  procedure  for  the  extraction  of  the  V/C  vector 
utilized  time-by-frequency-by-intensity  speech  spectrograms, 
made  on  a  Voiceprint  Identification,  Inc.,  Model  700 
spectrograph.      Speech  spectrograms  were  made  of  the  words 
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"good,"  "not,"  "cannot,"  and  "sort."    Hand  measurement  of 
the  vowels     and  their  associated  consonants  durations  were 
made  from  the  time-frequency-intensity  spectrograms.  These 
measurements  were  spot  checked  for  accuracy  by  an  independent 
observer.     The  V/C  ratios    were  formed  from  the  duration  of  a 
vowel  or  vowels  in  a  selected  word  and  the  duration  of  the 
consonant  or  consonants  in  that  same  word.     This  process 
allowed  the  formation  of  one  ratio  for  each  sampling  of  the 
selected  word.     The  V/C  was  only  utilized  as  a  speaker  identi- 
fication feature  in  the  first  two  experiments. 


Word  and  Phrase 
Durations  (WPP) 


The  WPD  vector  is  generated  from  measurements  of  the 
individual  duration  of  several  selected  words  and  phrases. 
The  words  chosen  for  use  in  the  WPD  vector  were:  "good," 
"not,"  "cannot,"  and  "sort."     In  addition  the  duration  of 
three  selected  phrases  were  computed.     These  phrases  include 
"they  have,"  "they  cannot,"  and  "it  is  not."    A  WPD  vector 
of  this  type  has  been  examined  previously  by  Wolf   (1972) , 
who  investigated  a  number  of  acoustic  parameters  for  use 
in  the  identification  of  speakers.     Specifically,  one  of 
the  parameters  he  used  was  the  duration  of  the  word  "bought." 
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He  indicated  that  this  word  duration  (alone)  did  not  provide 
good  speaker  identification  information  but  that  such  a 
parameter  may  be  useful. 

In  order  to  obtain  the  WPD  vector,  the  cited  words  and 
phrases  were  processed  in  the  same  manner  as  were  the  words 
for  the  V/C  vector.     That  is,  speech  spectrograms  were  made 
and  the  durations  calculated  from  hand  measurements .  This 
technique  yielded  a  vector  made  up  of  seven  parameters.  The 
WPD  vector  was  only  utilized  in  the  laboratory-normal  and 
laboratory-distorted  speech  experiments. 

Multiple  Vectors 

In  addition  to  examining  the  speaker  identification 
capabilities  of  each  of  the  vectors  separately,  these  vectors 
were  investigated  in  all  possible  combinations.  Specifi- 
cally, the  vectors  combinations  were  utilized  in  groups  of 
two  vectors    (TED  V/VL,  TED  V/C,   TED  WPD,  V/VL  V/C,  V/VL  WPD 
and  V/C  WPD),   three  vectors    (TED  V/VL  V/C,  TED  V/VL  WPD, 
TED  V/C  WPD,  and  V/VL  V/C  WPD),   and  all  four  vectors 
(TED  V/L  V/C  WPD) .     The  technique  of  combining  several 
parameter  sets  has  been  shown  to  improve  speaker  identifi- 
cation systems    (Doherty,   1976;  Sambur,    1973;  and  Goldstein, 
1976) . 
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This  multiple  vector  approach  provides  the  most  efficient 
use  of  these  temporal  vectors  in  a  speaker  identification 
system.     For  example,  if  two  vectors  contain  the  same  or 
similar  information  about  a  speaker's  voice  then  the  combining 
of  these  vectors  would  not  produce  improved  identification 
scores.     Conversely,  if  two  vectors  vary  independently  of 
one  another,   then  their  combination  should  produce  better 
identification  scores.     Therefore,  this  procedure  demon- 
strated which  vectors  are  contributing  new  information  and 
which  vectors  contain  duplicate  information. 

Experiment  One 

Laboratory,  Normal  (LN) 

This  initial  experiment  was  a  laboratory-based  study. 
The  subjects  read  speech  material  in  a  laboratory  environ- 
ment and  were  recorded  with  "ideal"  laboratory  equipment. 
The  purpose  of  this  experiment  was  to  develop  baseline  data  on 
the  speaker  identification  capabilities  of  the  temporal 
vectors.     A  detailed  description  of  the  subjects,  speech 
material,  and  recording  procedure  will  follow. 
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Subjects 

Forty  adult  speakers  were  chosen  from  a  volunteer  pool 
of  faculty  and  students  at  the  University  of  Florida. 
Subject  selection  was  made  on  the  following  basis:      (a)  18 
to  40  years  of  age,    (b)   no  apparent  speech  defects,  and 
(c)  no  unusual  regional  or  foreign  dialects.     These  minimal 
requirements  yielded  a  relatively  homogeneous  population. 
Thus,  these  subjects  permitted  initial  baseline  testing 
of  the  temporal  parameters  and  presumably  reduced  any 
obvious  inter-speaker  variations. 

Speech  Material 

Subjects  read  a  modernization  of  "An  Apology  for 
Idlers"  by  R.  L.  Stevenson,  an  approach  which  permitted  the 
speech  samples  to  be  context  independent.     The  passage  was 
chosen  because  it  is  relatively  long   (approximately  600 
words)   and  contains  most  phonemes  of  the  English  language. 
Therefore,   it  provided  a  good  representation  of  the  sub- 
ject's speech  repertoire  and  allowed  the  sample  to  be 
divided  into  several  smaller  subunits  where  necessary. 


50 

Procedure 

The  subjects  utilized  in  the  LN  experiment  were  recorded 
under  laboratory  conditions,  in  an  1AC-1200  sound-treated 
chamber  with  an  Ampex  Model  No.  354  tape  recorder  coupled  to 
an  Electro-Voice  microphone  Model  No.  EV  664.     The  recorded 
speech  samples  were  divided  into  four  subsamples   (30  seconds) 
in  order  to  permit  the  extraction  of  the  TED  and  WL  vectors.. 
Also,  selected  words  and  phrases  were  processed  as  described 
in  the  V/C  and  WPD  sections. 

Experiment  Two 

Laboratory  Distorted 
Speech  (LP) 

As  stated  earlier,  the  purpose  of  the  second  experiment 
was  to  investigate  the  effects  of  speaker  distortions  (volun- 
tary and  involuntary)  on  the  robustness  of  the  temporal  vectors, 
re:     the  speaker  identification  task.     The  subjects  were 
recorded  under  the  same  laboratory  conditions  as  those  of 
the  first  experiment.     However,  this  experiment  involved 
three  speaking  conditions   (normal,  stress,  and  disguise). 
This  experiment  provided  data  on  the  consequences  of  speaker 
distortion  on  this  speaker  identification  system.     A  complete 
description  of  the  procedures  utilized  will  be  found  below. 
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Subjects  and  Speech  Material 

The  speakers  were  20  male  faculty  and  graduate  students 
at  the  Institute  for  Advanced  Study  of  the  Communication 
Processes  and  the  Department  of  Speech  at  the  University  of 
Florida.     These  subjects  were  normal  speakers  of  American 
English  ranging  in  age  from  approximately  25  to  45  years; 
they  exhibited  no  unusual  dialects,  speech  or  voice  disorders. 
The  speech  material  for  this  experiment  was  the  same  as  that 
used  in  the  first  experiment. 

Procedure 

The  subjects  were  recorded  using  the  same  equipment 
described  for  the  first  experiment.     However,  three  speaking 
conditions  were  included  in  the  recording  procedure.  They 
were:     (a)  normal  speech  (control),    (b)   stress,  and  (c) 
disguise.    Emotional  stress  can  be  defined  in  a  number  of 
ways;  in  this  case,   it  was  induced  by  applying  electric 
shock,  delivered  randomly,  while  the  subject  was  speaking. 
For  the  third  condition,   subjects  were  requested  to  disguise 
their  speech  as  completely  as  they  could.     The  only 
restriction  placed  on  them  was  that  they  could  not  use  a 
"foreign  dialect"  or  whisper;  in  addition,   they  were 
encouraged  to  use  only  the  modal  voice  register. 
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The  recorded  speech  samples  were  divided  into  four 
subsamples   (30  seconds)   for  TED  and  WL  analysis.     For  the 
V/C  and  WPD  vectors  the  selected  words   ("good,"  "not," 
"cannot,"  and  "sort")  and  phrases   ("they  have,"  "they  cannot," 
and  "it  is  not")  were  extracted  as  specified  above. 

Experiment  Three 
Semi-Field  Conditions  (SF) 

The  third  and  final  experiment  had  as  its  purpose  the 
testing  of  the  identification  capabilities  of  the  temporal 
vectors  under  less  than  laboratory  conditions.     In  this  case, 
subjects  simulated  a  "crime"  over  a  telephone.     Later,  suspect 
interrogation  procedures  were  carried  out.     It  was  hoped  that 
the  results  of  this  experiment  would  provide  information  as 
to  how  well  the  selected  temporal  vectors  performed  under 
conditions  which  more  closely  parallel  the  forensic  model. 

Subjects  and 
Speech  Material 

The  subjects  for  this  semi-field  experiment  were  12 
adult  volunteers  drawn  from  local  law  enforcement  agencies. 
From  this  pool  of  volunteers,  one  of  the  subjects  was 
selected  to  assume  the  role  of  the  criminal;  he  simulated 
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a  telephone  related  "crime"   (a  kidnapper's  demand  call). 
The  remaining  eleven  subjects  were  designated  as  "suspects." 
All  subjects   (the  11  suspects  and  the  caller)  were  recorded 
during  an  interview  in  which  they  recited  statements  made 
by  the  original  caller.     This  interrogation  procedure 
permitted  the  control  of  context  and  provided  a  closed  set 
approach. 

Procedure 

The  criminal  call  was  recorded  over  a  telephone  on  a 
reel-to-reel  tape  recorder  via  a  direct  line  hookup.     It  was 
made  from  a  telephone  in  a  relatively  quiet  environment — a 
procedure  that  provided  for  reasonably  high  quality  recordings 
of  this  type.     However,  the  interrogation  procedure  was 
recorded  in  quite  a  different  manner.     In  this  case,  all 
subjects   (suspects  and  caller)  were  recorded  in  a  large  and 
relatively  noisy  room.     This  procedure  was  followed     in  order 
to  model  a  typical  forensic  situation.     Only  the  TED  and  WL 
vectors  were  extracted  from  these  recordings.     The  V"/C  and 
WPD  vectors  could  not  be  utilized  because  the  particular 
subject  recordings  did  not  provide       sufficient  repeated 
words  and  phrases. 
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Statistical  Analysis 

Of  the  many  techniques  available,  discriminant  analysis 
was  chosen  as  the  statistical  approach  to  the  identification 
of  the  speakers.     This  particular  technique  was  chosen  over 
several  others  because  it  demonstrated  higher  speaker 
identification  results  than  have  other  such  methods  as 
Euclidean  distance  analysis  or  cross-correlation  (Doherty, 
1976;  Doherty  and  Hollien,   1978;  and  Zalewski  et  al. ,  1975). 

Discriminant  analysis  computes  a  set  of  linear  functions, 
termed  discriminant  functions,  which  then  are  utilized  to 
classify  individual  samples  or  observations  into  one  of 
several  groups.     In  the  case  of  this  research,  the  dis- 
criminant functions  were  utilized  to  classify  an  unknown 
speaker's  test  sample  against  reference  sets  generated  on 
known  speaker  sets.     The  input  data  to  this  procedure  consisted 
of  sets  of  samples;  each  of  which  contained  values  for  all 
the  parameters.     From  the  parameters  an  F-statistic  was 
calculated  and  used  to  determine  which  parameters  were  the 
most  powerful  as  identification  cues. 

Three  classification  methods  were  utilized  within  the 
statistical  technique.     Initially,  the  known  or  reference  set 
consisted  of  all  four  of  the  talker's  speech  samples. 
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Then,  each  sample  was  reclassified,  in  turn,  with  respect  to 
the  reference  sets.     This  method  was  labeled  posterior 
classification  and  was  used  to  "pretest"  the  speaker 
identifying  features  investigated  in  this  research.  The 
second  method  chosen  was  the  jackknife  approach.     In  this 
procedure,  each  sample  was  eliminated  (in  turn)   from  the 
reference  set  before  computation  of  the  discriminant  func- 
tions were  carried  out.     In  this  case,  classification  was 
made  on  the  "removed"  speech  samples.      A  third  method  was 
utilized  to  simulate  the  forensic  model.     The  initial  of  a 
talker's  sample  was  arbitrarily  designated  the  test  sample 
and  the  reference  set  consisted  of  the  remaining  three 
samples.     This  final  method  was  employed  in  the  identifica- 
tion task. 


CHAPTER  III 


THE  RESULTS  AND  DISCUSSION  OF  THE 
LABORATORY-NORMAL  EXPERIMENT 

The  initial  experiment  had  two  major  purposes.  The 
first  was  to  examine  certain  temporal  characteristics  of  an 
individual's  speech  in  order  to  discover  if  they  permit 
him  to  be  identified  from  his  voice  alone;  a  second  ob- 
jective was  to  establish  baseline  speech  data.    As  described 
in  previous  sections,  adult  males  provided  speech  samples 
from  which  the  following  temporal  vectors  were  extracted: 
(1)  a  time-energy  distribution  vector,    (2)  a  voiced/voiceless 
speech  time  vector;    (3)   a  vowel/consonant  duration  vector; 
and  (4)  a  vector  resulting  from  the  analysis  of  the  dura- 
tions of  several  selected  words  and  phrases.  These 
vectors  were  made  up  of  a  number  of  parameters ;  each  was 
tested  for  its  speaker  identification  capability  with 
respect  to  the  acoustically  controlled  environment  of  this 
laboratory  type  experiment.     A  description  of  the  obtained 
results  and  a  discussion  of  these  findings  follow. 
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Results 

N 

The  pretest  and  identification  scores  which  were  ob- 
tained when  the  vectors  were  utilized  singly  may  be  found 
in  Table  2.     Three  types  of  speaker  classifications  were 
used  to  test  these  vectors;  two  of  which  were  pretests  and 
the  third  an  identification  task.     The  first  pretest  was 
labeled  "posterior  classification."     In  this  case,  all 
samples  were  utilized  in  the  reference  set  determination 
and  then  each  sample  was  classified  in  relationship  to  one 
of  the  reference  sets.     The  purpose  of  this  initial  pre- 
test was  to  examine  the  inter-sample  variability.     By  means 
of  this  first  method,  it  was  found  that  the  time-energy 
distribution   (TED)   vector  could  be  utilized  to  correctly, 
recognize  the  speakers  100  per  cent  of  the  time;  a 
correct  classification  rate  of  57.5  per  cent  was  found 
for  the  voiced/voiceless  speech  time  (WL)  vector.  The 
scores  for  the  vowel/consonant  duration  ratio  (V/C) 
vector  decreased  to  only  22.5  per  cent  correct  while  the 
word  and  phrase  duration  (WPD)   vector  yielded  no  correct 
classifications  at  all. 
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Table  2.     Pretest  and  Identification  Scores  for  the  Laboratory- 


Normal  Experiment  Obtained  by  Utilizing  the  Time- 
Energy  Distribution  (TED) ,  Voiced/Voiceless  Speech 
Time  (WL)  ,  Vowel/Consonant  Duration  Ratio  (V/C)  , 
and  Word  and  Phrase  Duration  (WPD)  Vectors  on  an 
Individual  Basis,  N  =  40 . 


Vectors 


Pretest  Classifications 
Posterior         Jackknif ed 


Identifications 


TED 


100.0% 


82.5% 


25.0% 


WL 


57.5 


37.5 


25.0 


V/C 


22.5 


7.5 


5.0 


WPD 


0.0 


0.0 


0.0 
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The  second  type  of  pretest  used  was  a  " jackknif ing" 
method.     That  is,  each  sample,  in  turn,  was  eliminated  from 
the  discriminant  computations  and  a  reference  set  formed 
from  the  remaining  samples.     Each  sample  then  was  classified 
according  to  the  particular  reference  set  developed  when 
it  was  removed.     This  pretesting  procedure  allowed  all 
of  a  given  individual's  speech  samples,  taken  in  a  complete 
set,  to  be  evaluated  for  their  identification  capabilities. 
When  the  jackknif ing  method  of  speaker  selection  was  uti- 
lized, consistently  lower  classification  scores  were  recorded 
However,  the  same  general  trends  exhibited  by  the  posterior 
pretesting  method  also  were  found  when  this  second  pro- 
cedure was  employed.     That  is,  a  classification  score  of 
82.5  per  cent  correct  was  attained  by  the  TED  vector. 
The  WL  and  the  V/C  vectors  yielded  scores  of  37.5  per 
cent  and  7.5  per  cent,  respectively;  the  WPD  again  did 
not  result  in  any  correct  speaker  selections. 

Finally,  identifications  were  carried  out  in  the 
following  manner.     The  first  speech  sample  for  each  talker 
was  chosen  as  the  test  sample  and  the  remaining  samples 
were  utilized  as  the  reference  set.     The  first  sample  was 
chosen  as  the  test  because  it  is  believed  to  represent 
the  most  variable  portion  of  the  entire  speech  sample.  In 
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the  identifications,  the  TED  and  WL  vectors  both  resulted 
in  a  25  per  cent  correct  level  of  speaker  selection. 
Utilizing  the  V/C  vector  only  5  per  cent  of  the  speakers 
were  correctly  identified  and  the  WPD,  as  would  be  expected 
from  the  pretests,  yielded  no  correct  identifications  at  all. 

The  second  phase  of  this  experiment  consisted  of  testing 
the  identification  abilities  of  the  temporal  vectors  in  all 
possible  combinations.     Identification  scores  of  the  two- 
vector  combinations  may  be  found  in  Table  3 .    As  can  be 
seen  from  the  table,  the  same  general  trends  were  found  as 
when  the  vectors  were  tested  singly.     Specifically,   it  should 
be  noted  that  the  TED  and  WL  combination  yielded  the 
highest  score  in  the  identification  task  (40  per  cent 
correct) .    Also,  it  is  apparent  that  the  addition  of  V/C 
and  WPD  to  either  of  the  other  two  vectors  did  not  increase 
the  identification  rates  appreciably.     The  three-  and  four- 
vector  combinations  may  be  found  in  Table  4.     The  speaker 
recognition  scores  for  this  group  of  vector  combinations 
appear  to  have  resulted  in  essentially  diminishing  returns. 
That  is,  the  scores  did  not  change  substantially  from  one 
vector  grouping  to  another.     This  plateau  effect  can  be 
noted  especially  with  respect  to  the  identification  task. 
The  combinations  of  TED,  WL,   and  V/C  and  the  TED,  V/C, 
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Table  3.  Pretest  and  Identification  Scores  for  the  Laboratory- 
Normal  Experiment  Obtained  by  Utilizing  the  TED,  WL, 
V/C,  and  WPD  Vectors  in  All  Possible  Pairs,  N  =  40. 


Vectors 

Pretest  Classification 
Posterior  Jackknifed 

Identifications 

TED  X  WL 

100.0% 

80.0% 

40.0% 

TED  X  V/C 

95.0 

50.0 

27.5 

TED  X  WPD 

100.0 

82.5 

25.0 

WL  X  V/C 

82.5 

45.0 

30.0 

WL  X  WPD 

57.5 

37.5 

25.0 

V/C  X  WPD 

37.5 

17.5 

5.0 
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Table  4.     Pretest  and  Identification  Scores  for  the  Laboratory- 
Normal  Experiment  Obtained  by  Utilizing  the  TED,  WL, 
V/C,  and  WPD  Vectors  in  All  Three-  and  Four -Vector 
Combinations,  N  =  40. 


Vectors 

Pretest  Classification 
Posterior  jackknifed 

Identifications 

TED  X  WL  X  V/C 

100.0%  72.5% 

47 . 5% 

TED  X  WL  X  WPD 

100.0  80.0 

42.5 

TED  X  V/C  X  WPD 

100.0  65.0 

47.5 

WL  X  V/C  X  WPD 

87.5  47.5 

35.0 

TED  X  WL  X  V/C 

X  WPD 

100.0  72.5 

45.0 
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and  WPD  both  yielded      47.5  per  cent  correct  identification 
scores.     Whereas,  the  two  combinations  TED,  WL,  and  WPD 
and  the  four-vector  combination  vectors,  yielded  identifi- 
cation scores  of  42.5  per  cent  and  45  per  cent,  respectively 

Discussion 

Single  Vector  Effectiveness 

Based  on  the  results  of  the  laboratory-normal  experi- 
ment, it  would  appear  possible  to  consider,  at  least,  some 
of  the  basic  questions  asked  by  this  research.     For  example, 
of  the  several  selected  temporal  parameter  groups,  the 
time-energy  distribution  (TED)   vector  apparently  con- 
tained the  most  idiosyncratic  characteristics,  at  least 
for  these  normal  or  "ideal"  recording  conditions.  This 
judgement  is  based  on  the  fact  that  application  of  the 
TED  resulted  in  the  highest  correct  classifications  being 
made.     Utilizing  correct  speaker  selection    as  the  judge- 
ment criterion,  the  voiced/voiceless  speech-time  (WL) 
vector,  vowel/consonant  duration  ratio  (V/C)  vector,  and 
the  word  and  phrase  duration   (WPD)   vector  followed  TED 
in  a  decreasing  order  of  effectiveness. 

There  are  several  possible  explanations  for  this  rank- 
ing of  the  temporal  vectors .     The  high  identification 
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scores  for  the  TED  vector  may  well  have  been  predicted 
since  it  contained  the  largest  number  of  parameters  (40) 
and,  therefore,   it  possibly  contained  more  information 
about  the  talker's  speech  characteristics  than  did  any  of 
the  other  vectors  investigated  in  this  research.  In 
addition,  previous  research  has  demonstrated  the  possible 
effectiveness  of  a  time  energy  vector.     For  example,  in 
1963  Pruzansky  examined  a  time-energy  distribution  similar 
to  the  one  utilized  in  this  study.     She  found  that,  of  her 
ten  speakers,  about  half  were  correctly  identified.  Also, 
spectral  information  may  be  considered  to  be  a  frequency 
counterpart  of  the  TED  vector,  and  spectral  analysis  has 
been  shown  to  exhibit  a  number  of  speaker  dependent 
properties.     That  is,  while  it  does  not  necessarily  follow 
that  the  work  of  Majewski  and  Hollien   (1974) ,  Hollien  and 
Majewski   (1977),  Doherty  and  Hollien  (1978),  and  others 
would  predict  the  relative  success  of  the  TED  vector,  their 
results  do  infer  that  an  attempt  to  utilize  the  time- 
energy  information  may  be  warranted. 

The  voiced/voiceless  speech-time  vector  (containing 
only  two  parameters)   also  demonstrated  identification 
scores  that  were  sometimes  equal  in  magnitude  to  those  of 
the  TED  vector.     These  identification  levels  possibly  could 
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have  been  predicted  on  the  basis  of  studies  that  examined 
the  function  of  phonation  in  the  speaker  identification 
process  (see  for  example,  LaRiviere,  1975;  Atal,  1972; 
Wolf,   1972;  Sambur,   1975;  Doherty,   1976;  and  Doherty  and 
Hollien,  1978) .     The  results  of  these  investigations  seem 
to  support  the  supposition  that  the  level  of  phonation  or 
vocal  activity  plays  a  role  in  the  speaker  identification 
task. 

In  direct  contrast  to  the  TED  and  WL  vectors,  the 
application  of  the  vowel/consonant  duration  ratio  and  the 
word  and  phrase  duration  vectors  resulted  in  very  low 
identification  scores.     These  low  scores  may  be  due,  in 
part,  to  the  method  by  which  they  were  obtained.  Specifi- 
cally, hand  measurements  made  on  sound  spectrograms  were 
used  and  therefore  lent  themselves  to  imprecision.  Also, 
the  number  of  samples  and  type  of  samples  were  severely 
limited.     Even  if  the  V/C  and  WPD  vectors  did  not  produce 
high  identification  scores,  they  should  not  be  discounted 
totally  for  future  studies  as  not  having  potential  as 
speaker-identifying  features.     Indeed,  the  speaker 
identification  literature  supports  the  importance  of 
analysis  of  vowels  and  consonants  in  identification  task 
(For  example,  see  Bricker  and  Pruzansky,   1966;  Stevens 
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et  al.i   1968;  Illes,  1972;  LaRivere,  1974;  Wolf,  1972; 
Goldstein,   1976;  and  others) . 

Multiple  Vector 
Effectiveness 

The  results  of  the  paired  vectors  appear  to  exhibit 
the  same  general  trends  as  in  the  case  of  the  findings  of 
the  single  vectors.     However,  most  of  the  vector  pairs 
exhibited  at  least  slight  increases  in  their  identification 
levels  on  both  the  pretest  and  identification  tasks.  As 
would  be  expected   (from  examining  the  single  vector  scores) , 
the  TED-WL  vector  combination  resulted  in  the  highest 
identifications  for  any  paired  vectors.     This  finding 
supports  the  assumption  made  earlier  that  TED  and  WL 
vectors  contain  more  idiosyncratic   (talker)  characteristics 
than  do  either  of  the  remaining  two  vectors   (i.e.,  V/C  or 
WPD)  .     The  high  levels  for  the  TED-WL  combination  also 
suggest  that  these  two  vectors  are,   for  the  most  part, 
sampling  different  types  of  information.     That  is,   if  the 
information  contained  in  the  vectors  was  redundant,  there 
should  not  have  been  an  increase  in  the  combined  identifi- 
cation score.     On  the  other  hand,  if  the  information  was 
mutually  exclusive,   it  would  be  expected  that  the  combined 
identification  rates  would  be  about  equal  to  the  sum  of 
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the  single  vector  identification  levels.     In  the  case 

of  the  TED-WL  vector,  the  individual  scores  were  25  per 
cent  correct  identification  each,  but  the  score  when  they 
were  paired  was  40  per  cent  correct.     Thus,  the  paired 
scores  support  the  suggestion  that  these  two  vectors  sample 
different  speaker-dependent  characteristics.     At  this  stage 
in  the  research  it  is  impossible  to  know  exactly  what 
characteristics  are  being  measured.     However,  some  specula- 
tion as  to  the  composition  of  this  speaker  information  is 
possible.     Since  the  TED  vector  is  a  temporal  correlate  of 
the  speech  spectra,   it  is  possible  that  some  of  the  informa 
tion  being  measured  by  the  TED  vector  is  spectral  in  nature 
On  the  other  hand,   the  WL  vector  deals  with  the  level  of 
phonation.     Therefore,  this  vector    most  likely  contains 
only  fundamental  frequency  information. 

Further  examination  of  the  paired  vectors  demonstrates 
that  when  either  the  V/C  or  WPD  vectors  are  added  to  the 
other  vectors   (or  to  each  other) ,  the  results  provide 
little  or  no  improvement  in  identification  level.  However, 
there  is  an  interesting  relationship  which  occurs  when  the 
V/C  vector  is  paired  with  either  the  TED  or  WL  vectors. 
When  the  TED-V/C  combination  was  utilized,  the  pretest 
scores  were  lower  than  for  the  TED  alone — thus,  the  overall 
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performance  of  the  TED  vector  was,  in  a  sense,  degraded. 
This  finding  would  suggest  that  the  information  being 
sampled  by  V/C  was  enhancing  speaker  similarities  and  the 
procedures  utilized  in  this  research  were  insufficiently 
sensitive  to  separate  and  use  the  speaker-dependent  informa- 
tion.    On  the  other  hand,  the  pairing  of  the  WL  and  V/C 
vector  resulted  in  some  improvement  in  the  identification 
levels  of  both  the  pretest  and  identification  tasks.  There- 
fore, the  results  of  the  WL-V/C  vector  combination  suggest 
that  the  data  being  sampled  are  significantly  different  and 
thus  speaker  identification  is  enhanced. 

The  pretest  classification  and  identification  scores 
reached  a  peak  of  "effectiveness"  in  the  three-  and  four- 
vector  combinations.     However,  these  increases  were  not  of 
the  same  magnitude  as  those  achieved  when  vectors  were 
paired.    An  explanation  for  this  relationship  may  be  found 
in  the  parameter  selection  procedure.     For  example,  once 
the  TED  and  WL  vectors  have  been  included  in  a  vector 
combination,  most  of  the  parameters  available  for  identi- 
fication are  accounted  for.     Therefore,  addition  of  the 
remaining  vectors  contribute  little  in  the  way  of  new 
parameters.     Hence,  very  little  improvement  in  speaker 
identification  levels  can  be  expected. 
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Parameter  Selection 

As  stated  previously,  from  two  to  40  separate  parameters 
were  entered  into  the  discrimination  process.     However,  a 
statistical  selection  procedure  was  utilized  in  order  to 
maximize  the  correct  identification  scores  and  to  reduce 
the  number  of  parameters  needed  for  identification.     That  is, 
a  stepwise  statistical  method  of  parameter  inclusion  and 
exclusion  was  carried  out  during  the  discriminant  analysis. 
The  determination  of  predictiveness  for  each  parameter 
and/or  group  of  parameters  was  carried  out  utilizing  an 
F-statistic  with  a  computation  formula  that  may  be  found 
in  Forsythe  et  al.    (1973).     In  accordance  with  this  system 
of  parameter  selection,  only  the  most  predictive  measure- 
ments were  entered  into  the  discrimination  process.     In  this 
first  experiment,   the  40  parameters  of  the  time-energy 
distribution  vector  were  reduced  to  only  17  usable  parame- 
ters.    However,  both  of  the  WL  parameters,  voiced  speech 
time  and  voiceless  speech  time,  were  utilized.     Of  the  four 
vowel/consonant  ratios,  only  two  met  the  sample  require- 
ments of  discriminant  analysis,  V/C  ratio  of  "good"  and  the 
V/C  ratio  of  "not."     From  these  two  parameters  only  the 
V/C  ratio  of    not    was  of  any  functional  value  as  a  pre- 
dictive element  in  the  identification  process.     As  a 
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single  vector,  the  word  and  phrase  duration  (WPD)  vector 
yielded  no  usable  parameters. 

Further,   in  the  parameter  selection  process  of  the 
multiple  vectors,  some  variables  were  added  to  the  vector 
group  while  others,  formally  included,  were  dropped.  In 
the  larger  vector  (TED) ,  some  of  the  parameters  utilized 
in  the  single  pattern  were  dropped  when  the  parameters  of 
the  WL  vector  were  included.     However,  the  inclusion  of  the 
V/C  and  WPD  vector  resulted  in  no  change  in  the  TED 
parameter  selection.     Generally,  only  one  of  the  V/C 
parameters  was  included  in  multiple  vectors  groups.  This 
relationship  may  have  been  caused  because  of  the  type  of 
information  being  measured  in  the  various  parameters.  In 
this  case,  certain  parameters  of  the  vectors  duplicate  the 
data  measured  by  parameters  of  other  vectors.  Therefore, 
some  of  the  parameters  have  reduced  predictive  power  and 
are  dropped. 

Test  Sample  Selection 

As  should  be  readily  apparent  from  the  results  in 
Tables  2-4,  the  method  of  classification  plays  an  important 
role  in  determining  the  magnitude  of  the  pretest  or  identi- 
fication scores.     In  all  cases,  the  posterior  classification 
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procedure  facilitated  the  highest  percentage  of  correct 
subject  selection — and  there  appeared  to  be  two  explana- 
tions for  these  higher  scores.     First,   in  this  case  the  test 
sample  was  utilized  in  the  computation  of  the  reference 
discriminant  functions.     Consequently,  the  classification 
procedure  is  biased  toward  the  selection  of  the  correct 
reference  set.     Second,  by  utilizing  each  of  the  four 
samples   (in  turn)   as  test  samples,   four  opportunities  for 
recognizing  the  correct  reference  set  were  allowed.  There- 
fore, this  procedure  makes  the  chances  of  matching  the 
correct  test  and  reference  much  greater  than  if  only  one 
sample  was  utilized. 

The  jackknifed  classification  method  demonstrated  the 
next  highest  correct  classification  levels.    As  stated 
earlier,  this  method  did  not  utilize  the  test  sample  within 
the  reference  set  computations  and  this  difference  may 
account  for  the  decreased  scores,  relative  to  the  posterior 
classifications.     However,  the  jackknifed  classifications 
were  better  than  those  for  the  identification  task.  The 
most  obvious  explanation  for  this  finding  is  that  in  the 
jackknife  pretest  all  four  samples  were  used  as  tests, 
whereas  the  identification  task  utilized  only  one  of  the 
speech  samples  as  the  test. 
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The  identification  task    was  the  most  important  test 
of  vectors  speaker  identification  effectiveness.     The  pre- 
test classifications  serve  the  purpose  of  demonstrating 
which  parameter  groups  have  the  greatest  potential  as 
speaker-identifiers.     However,  they  are  only  usable  in 
closed  set  tests   (when  the  criminal  is  known).  Therefore, 
the  identification  task  must  be  considered  the  core  test  of 
this  research.     This  approach  utilized  only  the  first  of 
the  four  speech  samples  as  the  speech  sample.  Hence, 
classifications  in  this  case  were  based  solely  on  the 
information  being  sampled  by  that  single  speech  sample. 
A  second  identification  task  was  investigated  utilizing  the 
fourth  speech  sample  as  the  test.     This  procedure  was 
carried  out  in  order  to  test  the  consistency  of  the  first 
identification  task.     The  vector  effectiveness  resulting 
from  the  second  identification  task  followed  more  closely 
the  trends  set  in  the  posterior  and  jackknifed  procedures. 
Specifically,  the  TED  vector  resulted  in  the  highest  identi 
fication  score   (52.0  per  cent  correct).     The  WL  vector 
followed  with  a  considerably  lower  identification  score 
of  12.5  per  cent  correct  and  both  the  V/C  and  WPD  vectors 
attended  the  same  low  scores  resulting  from  the  first 
identification  task.     This  set  of  findings  seems  to 
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indicate  that  the  identification  effectiveness  of  a  vector 
is  not  consistent  throughout  the  four  speech  samples.  That 
is,  for  the  TED  vector,  the  first  sample  was  less  idio- 
syncratic of  a  given  speaker  and  therefore  less  recognizable 
than  was  the  fourth  speech  sample.     However,  for  the  WL 
vector,  the  first  sample  was  more  idiosyncratic  of  a 
particular  speaker  than  was  the  fourth  sample .     In  addi- 
tion, while  the  identification  tasks  attempt  to  model  more 
closely  the  forensic  model,  this  fourth  sample  identifica- 
tion task  suggests  that  the  pretest  procedures  (especially 
the  jackknifed  method)  provide  a  better  indication  of  the 
relative  identification  capabilities  of  these  selected 
temporal  vectors.     Finally,  the  identification  task  demon- 
strates that  the  more  restrictive  the  test  sample  selection 
procedure  becomes,  the  more  difficult  the  identification 
of  the  correct  speaker  becomes.     Nevertheless,   it  also 
should  be  noted  that  the  type  of  classification  method 
utilized  to  make  the  speaker  discriminations  did  not  change 
the  vector  effectiveness  ranking  among  the  temporal  vectors. 

In  summary,  the  following  conclusions  may  be  stated 
with  respect  to  the  selected  temporal  vectors  and  their 
speaker  identification  capabilities. 
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1.  Under  the  constraints  of  this  experiment,  the 
time-energy  distribution  and  the  voiced/voiceless 
speech  time  vectors   (and  to  a  lesser  degree 

the  vowel/consonant  duration  ratio)  appear  to 
exhibit  idiosyncratic  speaker  identifying 
characteristics . 

a.  The  V/C  and  WPD  vectors,  as  defined  and 
measured  in  this  study,  do  not  function 
well  as  cues  to  speaker  identification. 

2.  The  TED  and  WL  vectors  appear  to  contain  dis- 
tinctly different  speaker  related  information. 

3 .  The  V/C  vector  enhances  the  performance  of  the 
WL  vector  while  it  seems  to  degrade  the  speaker 
identifying  abilities  of  the  TED  vector. 


CHAPTER  IV 


THE  RESULTS  AND  DISCUSSION  OF  THE  LABORATORY — 
DISTORTED  SPEECH  EXPERIMENT 

The  second  experiment  undertaken  in  this  program  of 
research  was  carried  out  in  order  to  test  the  speaker  identi- 
fication capabilities  of  the  selected  temporal  parameters 
with  respect  to  speaker  distortions  of  a  specific  type. 
The  speakers   (20  adult  males)  were  recorded  under  the  same 
conditions  as  in  the  first  experiment.     However,  three 
experimental  speaking  situations  were  imposed  upon  the 
talkers.    As  detailed  earlier,  these  speaking  conditions 
were:     (1)   normal,    (2)   stress,  and   (3)  disguise.  The 
results  of  this  second  experiment  may  be  found  below. 

Results 

Normal  Speaking  Condition 

In  a  manner  similar  to  that  utilized  in  the  first 
experiment,   three  test  selection  procedures  were  used  to 
examine  the  identification  capabilities  of  the  selected 
temporal  measures.     Two  pretesting  methods,  posterior  and 
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jackknifed  classification,  again  were  used  to  investigate 
the  intra-  and  inter-sample  variability.     The  third  approach 
was  that  of  speaker  identification;  this  third  method  was 
utilized  to  simulate  the  forensic  model.     The  results  of 
these  three  experimental  procedures  may  be  found  in  Table  5. 
As  may  be  seen  from  these  results,  the  time-energy  distri- 
bution (TED)  vector,  when  applied  in  isolation,  produced 
the  highest  pretest  and  identification  scores.     This  vector 
correctly  classified  all  twenty  speakers  in  the  posterior 
procedure,  95  per  cent  of  the  speakers  in  the  jackknifed 
pretest,  and  60  per  cent  of  the  speakers  utilizing  the 
identification  task.    Following  in  vector  effectiveness 
(re.  speaker's  identity)  was  the  voiced/voiceless  speech 
time   (WL)  parameters.     In  this  case,  the  correct  speaker 
classification  rates  were  65  per  cent  and  40  per  cent  in  the 
two  pretesting  methods  but  only  at  a  7.5  per  cent  level 
relative  to  the  identification  method.     Application  of  the 
vowel/consonant  duration  ratio  (V/C)  and  the  word  and 
phrase  duration  (WPD)   vectors  resulted  in  no  correct 
classifications  or  identifications  at  all. 

By  further  examination  of  Table  5,   it  may  be  seen  that 
the  classification  and  identification  scores  of  the  combined 
vectors  reveal  several  relationships;  that  is,  those  that 
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Table  5.     Pretest  and  Identification  Scores  of  the  Laboratory 
Distorted  Speech  Experiment  Obtained  Utilizing  the 
Time-Energy  Distribution  (TED) ,  Voiced/Voiceless 
Speech  Time  (WL)  ,  Vowel/Consonant  Duration  Ratio 
(V/C)  and  Word  and  Phrase  Duration  (WPD)  Vectors 
in  the  Normal  Speaking  Condition:     N  =  20;  All 
Scores  Are  in  Percentages. 


Pretest  Classifications  Identifi- 
Vectors  Posterior  Jackknifed  cations 


A.  Single  Vectors 

TED 
WL 
V/C 
WPD 

B.  Paired  Vectors 

TED  X  WL 
TED  X  V/C 
TED  X  WPD 
WL  X  V/C 
WL  X  WPD 
V/C  X  WPD 

C.  Three-Vector  Combinations 

TED  X  WL  X  V/C 
TED  X  WL  X  WPD 
TED  X  V/C  X  WPD 
WL  X  V/C  X  WPD 

D.  Four -Vector  Combinations 
TED  X  WL  X  V/C  X  WPD 


100.0  95.0  60.0 

65.0  40.0  7.5 

0.0  0.0  0.0 

0.0  0.0  0.0 

100.0  95.0  55.0 

100.0  95.0  60.0 

100.0  95.0  60.0 

65.0  40.0  15.0 

80.0  40.0  20.0 

0.0  0.0  0.0 

100.0  95.0  55.0 

100.0  90.0  50.0 

100.0  95.0  60.0 

65.0  40.0  15.0 

100.0  90.0  55.0 
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could  be  predicted  from  the  examination  of  the  single  vectors. 
Moreover,   it  should  be  noted  that  the  effects  of  the  other 
vectors  when  combined  with  the  TED  vector  resulted  in  little 
or  no  improvement  in  the  overall  pretest  or  identification 
scores.     This  finding  resulted  from  the  fact  that  in  the 
posterior  classification  procedure,  all  the  speakers  were 
correctly  identified;  therefore  no  improvement  was  possible. 
In  the  case  of  the  jackknifed  and  identification  procedures, 
no  improvement  was  observed  because  few  new  parameters  were 
included  into  the  vector  groupings. 

Stress  Speaking  Condition 

In  the  case  of  the  speaking  condition  during  which  sub- 
jects were  stressed, the  posterior  classification  procedure 
was  eliminated.     This  classification  procedure  was  judged 
inappropriate  since  the  stressed  speaking  samples  were  com- 
pared only  with  the  normal  samples.     Therefore,  only  the 
jackknifed  pretest  and  identification  tasks  could  be  carried 
out.     The  results  of  this  set  of  tests  may  be  found  in  Table  6. 
Under  this  stress  condition,  the  TED,  again,  yielded  the 
highest  levels  of  speaker  classification  and  identification 
(70  per  cent  and  40  per  cent  correct) .     However,   the  WL 
vector  did  almost  as  well  as  TED  in  the  jackknifed  method, 
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Table  6.     Pretest  and  Identification  Scores  of  the  Laboratory- 
Distorted  Speech  Experiment  Obtained  Utilizing  the 
TED,  WL,  V/C,  and  WPD  Vectors  in  the  Stress  Speak- 
ing Condition:     N  =  20;  All  Scores  Are  in  Per- 
centages . 


Vectors 


Pretest  Classifications  Identifi- 
Jackknifed  cations 


A.  Single  Vectors 

TED 
WL 
V/C 
WPD 


70.0 
65.0 
0.0 
0.0 


40.0 
20.0 
0.0 
0.0 


B.  Paired  Vectors 


TED  X  WL 
TED  X  V/C 
TED  X  WPD 
WL  X  V/C 
WL  X  WPD 
V/C  X  WPD 


65.0 
70.0 
70.0 
65.0 
35.0 
0.0 


30.0 
40.0 
40.0 
20.0 
15.0 
0.0 


C.  Three-Vector  Combinations 


TED  X  WL  X  V/C 
TED  X  WL  X  WPD 
TED  X  V/C  X  WPD 
WL  X  V/C  X  WPD 


65.0 
65.0 
70.0 
35.0 


30.0 
35.0 
40.0 
20.0 


D.  Four -Vector  Combinations 


TED  X  WL  X  V/C  X  WPD 


65.0 


30.0 
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65  per  cent.     On  the  other  hand,  the  identification  rate 
of  the  WL  vector  was  20  per  cent  correct;  considerably 
lower  than  that  achieved  by  the  TED.     The  remaining  two 
vectors,  V/C  and  WPD,  did  not  correctly  classify  any  of  the 
talkers  when  these  vectors  were  used  in  isolation. 

The  multiple  vectors  produced  no  real  improvement  in 
the  classification  or  identification  of  the  stress  talkers 
over  those  of  the  single  vectors.     The  TED  X  WL  vector 
combination  achieved  the  same  jackknifed  score   (65  per  cent) 
as  that  of  the  WL  vector  and  only  a  slightly  better  identi- 
fication score   (30  per  cent) .     Combinations  involving  the 
V/C  and  WPD  vectors  resulted  in  no  improvement  and  in  some 
cases  their  effects  tended  to  degrade  the  level  of  correct 
identification.     For  example,  the  WL  X  WPD  vector  combina- 
tion produced  a  jackknifed  classification  of  35  per  cent 
correct  and  an  identification  rate  of  15  per  cent;  both  of 
these  scores  were  lower  than  those  of  the  WL  vector  utilized 
singly.     In  the  case  of  the  four  vector  combination,  the 
jackknifed  pretest  resulted  in  only  55  per  cent  correct. 
This  score  is  5  per  cent  lower  than  the  score  attended  when 
the  TED  Vectors  were  utilized  in  isolation. 
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Disguised  Speaking 
Condition 

The  results  of  the  disguise  condition  of  the  second 
experiment  may  be  seen  in  Table  7.     From  this  table  it  can 
be  seen  that  only  the  jackknifed  and  identification  procedures 
were  utilized.     The  posterior  classifications  could  not  be 
carried  out  because  all  classifications  of  this  speaking 
condition  were  done  by  comparing  the  disguised  samples  to  the 
normal  samples.     As  with  the  two  preceding  conditions,  the 
TED  vector  for  this  speaking  condition  resulted  in  the  highest 
set  of  scores.     In  the  jackknifed  pretest,  45  per  cent  of  the 
disguised  voices  were  correctly  matched  to  the  talkers  who 
produced  them;  correct  identifications  were  made  in  30  per 
cent  of  the  cases.     The  WL  vector  achieved  35  per  cent  cor- 
rect disguise  to  normal  matches  but  in  the  identification 
task  only  5  per  cent  of  the  voices  were  correctly  identified. 
Also,  as  was  seen  from  the  previous  speaking  conditions,  the 
V/C  and  WPD  vectors  were  unable  to  produce  any  correct 
classifications  or  identifications. 

Combining  the  vectors  resulted  in  some  classification 
and  identification  improvement  for  these  disguised  conditions. 
Specifically,  the  TED  X  WL  vector  combination  correctly 
classified  the  disguised  voices  60  per  cent  of  the  time; 
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Table  7.     Pretest  and  Identification  Scores  of  the  Laboratory- 
Distorted  Speech  Experiment  Obtained  Utilizing  the 
TED,  WIj,  V/C,  and  WPD  Vectors  in  the  Disguised 
Speaking  Condition:     N  =  20;  All  Scores  Are  in 
Percentages . 


Pretest  Classifications  Identifi- 
Vectors    Jackknif  ed  cations 


A.  Single  Vectors 

TED  45.0  30.0 

WL  35.0  5.0 

V/C  0.0  0.0 

WPD  0.0  0.0 

B.  Paired  Vectors 

TED  X  WL  60.0  40.0 

TED  X  V/C  45.0  30.0 

TED  X  WPD  45.0  30.0 

WL  X  V/C  35.0  5.0 

WL  X  WPD  35.0  15.0 

V/C  X  WPD  0.0  0.0 

C.  Three-Vector  Combinations 

TED  X  WL  X  V/C  60.0  40.0 

TED  X  WL  X  WPD  60.0  35.0 

TED  X  V/C  X  WPD  45.0  30.0 

WL  X  V/C  X  WPD  35.0  5.0 

D.  Four-Vector  Combinations 

TED  X  WL  X  V/C  X  WPD  60.0  40.0 
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this  combination  also  resulted  in  40  per  cent  correct  in  the 
identification  task.    All  other  vector  combinations  resulted 
in  scores  which  were  representative  of  the  vector  with  the 
highest  single  vector  score.     For  example,  vector  combina- 
tions which  contain  the  TED  parameters   (except  TED  X  WL) 
resulted  in  a  set  of  scores  which  were  identical  with  those 
of  the  TED  used  singly.     This  same  relationship  is  found 
for  combinations  containing  WL  and  TED  X  WL  parameters. 

Discussion 

Normal  Speaking 
Condition 

It  will  be  remembered  that  in  this  first  procedure  the 
classifications  are  made  by  comparing  the  normal  speaking 
samples  to  the  normals.     The  results  of  first  speaking  con- 
dition of  this  second  experiment  reaffirm  many  of  the  findings 
of  the  initial  experiment.     That  is,  the  time-energy  distri- 
bution (TED)  vector  appeared  to  be  the  most  effective  pre- 
dictor of  talker's  identity,  at  least  for  the  vectors  in- 
vestigated in  this  research.     It  also  is  demonstrated  that 
the  voiced/voiceless  speech  time   (WL)   vector  was  ranked 
second  to  TED  in  speaker  identification  capabilities,  whereas 
the  remaining  two  vectors   (vowel/consonant  duration  ratio 


84 


and  word  and  phrase  durations)   showed  little  speaker  identi- 
fication power. 

The  high  levels  of  identification  found  for  this  normal 
condition  might  have  been  expected  on  the  basis  of  the  results 
of  the  initial  experiment.     In  that  regard,  previous  re- 
searchers have  shown  that  "TED-like"  information  is  a  reason- 
able predictor  of  talker's  identity  (Majewski  and  Hollien, 
1974;  Doherty  and  Hollien,  1978,  and  others) .     For  example, 
Pruzansky  (1963)  reported  that  a  time-energy  distribution 
was  effective  as  a  speaker  identification  cue.     Therefore,  it 
may  be  assumed  that  a  vector  such  as  TED  contains  idiosyncratic 
speech  characteristic  which  would  permit  identification  of  a 
speaker  from  his  voice  alone.     Also,  the  higher  identification 
scores,  in  relation  to  the  other  vectors,  provided  by  the 
TED  vector  may  be  due,  at  least  in  part,  to  the  large  number 
of  parameters   (40)   available  for  the  identification  process. 
Thus,  a  great  deal  of  speaker-dependent  material  apparently 
is  utilized  in  this  classification  and  identification 
technique . 

The  WL  vector  also  demonstrated  a  modest  level  of 
effectiveness.     This  finding  also  was  shown  in  the  first 
experiment.     In  addition,  previous  investigation  has  shown 
vocal  activity  to  play  a  role  in  the  speaker  identification 
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process   (see,  for  example,  LaRiviere,  1975;  Atal,  1972; 
Wolf,   1972;  Sambur,   1975;  Doherty,   1976;  and  Doherty  and 
Hollien,   1978) .     Therefore,   it  should  appear  that  the  WL 
vector  measures  certain  invariant  speaker  identification 
features  and  this  vector  may  be  a  viable  tool  in  a  speaker 
identification  system. 

Stress  Speaking 
Condition 

This  experimental  condition  attempts  to  discover  if 
stressful  speaking  situations  reduce  the  speaker  identifica- 
tion capabilities  of  the  selected  temporal  parameters.  The 
results  of  this  experiment  demonstrated  that  the  type  of 
stress  utilized  in  this  investigation  had  the  effect  of 
lowering  the  level  of  identification.     For  example,  the  TED 
vector  yielded  classification  scores  reduced  by  25  per  cent 
(95  per  cent  vs.  70  per  cent)   and  identification  scores  reduced 
by  20  per  cent   (60  per  cent  vs.  40  per  cent).     These  two 
relationships  suggest  that  the  features  being  measured  by 
the  TED  vector  are  altered  or  varied  when  a  speaker  is  placed 
in  a  stressful  situation.     However,   it  should  be  remembered 
that  the  stress  speech  samples  are  being  compared  to  the  nor- 
mal speech  samples.     Thus,   the  samples  were  non-contemporary. 
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The  contemporainess  of  speech  samples  has  been  demonstrated 
to  be  very  delatorious  to  speaker  identification  (for  example, 
see  McGehee,   1937;  Rothman,   1977;  and  Tosi  et  al. ,  1972). 
Therefore,  the  contemporariness  of  the  speech  samples  must  be 
considered  as  a  contributing  factor  to  the  reduced  identifi- 
cation rates. 

The  jackknifed  classification  pretest  scores  obtained 
utilizing  the  WL  factor  demonstrated  trends  similar  to  TED. 
That  is,   the  level  of  correct  speaker  classification  was 
reduced  comparing  the  normal  and  stress  conditions.  However, 
the  WL  vector  yielded  improved  scores  under  the  stress 
speaking  condition.     A  possible  explanation  for  these  rather 
confusing  findings  may  be  found  in  examining  the  physiological 
mechanisms  which  are  affected  under  certain  stressful  con- 
ditions . 

When  the  combination  vectors  were  evaluated,   little  or 
no  improvement  in  the  classification  or  identification  scores 
was  demonstrated.     As  a  possible  explanation  for  this  find- 
ing,  it  would  appear  that  not  enough  new  information  was 
being  added  to  the  vector  combinations.     In  the  first 
experiment,   it  was  found  that  the  addition  of  more  variables 
improved,   to  varying  degrees,  the  overall  performance  of 
the  vectors.     It  is  possible  in  this  second  experiment  that 
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there  may  not  be  a  large  enough  population  to  express  the 
small  changes  which  may  be  occurring. 

Previous  research  on  the  relationship  between  stressful 
speaking  conditions  and  speaker  identification  has    been  very 
limited.    However,  a  few  studies  have  been  completed  which 
demonstrated  that  stress  speaking  conditions  do  have  some 
degrading  effects  upon  the  speaker  identification  processes. 
For  example,  Hollien  and  Majewski  (1977)  found  that,  under 
stress,  their  speaker  identification  rates  utilizing  long- 
term  speech  spectra  were  reduced  from  8  per  cent  to  20  per 
cent.     Further,   in  a  similar  LTS  study,  Doherty  and  Hollien 
(1978)  report  reductions  in  their  identification  system  for 
the  stress  condition.  These  reductions  were  in  the  order  of 
5  per  cent  to  20  per  cent  contrasted  to  the  normal  condition. 
To  date  no  investigators  have  examined  the  same  parameters 
as  this  research  under  stressed  speaking  conditions.  How- 
ever, the  findings  of  this  experiment  are  in  general  agree- 
ment with  those  limited  studies  completed.     Therefore,  it 
seems  certain  that  stress  reduces  the  reliability  of  some 
speaker  identification  systems.     These  findings  and  those 
of  other  investigators  demonstrate  that  any  speaker  identifi- 
cation system  must  be  tested  with  respect  to  stressful 
speaking  conditions.     Such  testing  of  systems  helps 
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predict  the  robustness  of  these  procedures  in  real-world 
situations . 

Disguised  Speaking 
Condition 

The  results  of  this  speaking  condition  demonstrate 
that,  while  voice  disguise  does  interfere  with  and  lower 
classification  scores,  some  scores  were  recorded  which 
were  substantially  above  chance.     This  finding  demon- 
strates that  the  speaker  identification  features  measured 
by  the  TED  vector  and  to  a  lesser  extent  the  WL  vector, 
are  still  present  and  effective  even  when  the  speaker 
attempts  to  disguise  his  voice.     Therefore,  these  two 
vectors  are  functional  in  a  speaker  identification  sys- 
tem.    Furthermore,  comparing  the  results  of  this  disguised 
speaking  condition  with  those  of  other  investigators  show 
the  TED  X  WL  combination  yielded  the  highest  set  of  scores 
(Hollien  et_al. ,   1974;  Hollien  and  McGlone,   1976;  Houlihan, 
1977;  Hollien  and  Majewski,   1977;  Doherty  and  Hollien,  1978; 
and  Reich  et  al.,   1976).     Therefore,   it  would  appear  that 
the  speaker  is  not  altering  the  features  being  measured  by 
the  TED  and  WL  vectors  to  the  same  degree  as  other  such 
features — speaking  fundamental  frequency  and  long-term 
speech  spectra . 
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Further  investigation  of  the  combination  vector, 
TED  X  WL,  shove  an  increase  of  15  per  cent  compared  to  the 
single  vectors.     This  finding  is  in  general  agreement  with 
those  of  the  initial  experiment.     Therefore,  the  explana- 
tions for  these  results  may  be  similar.     That  is,  the  TED 
and  WL  vectors  are,  for  the  most  part,  sampling  different 
types  of  speaker-dependent  information.     This  explanation 
is  reasonable  since  it  appears  that  the  TED  vector  is  measur 
ing  characteristics  which  may  be  considered  a  temporal 
counterpart    to  spectral  information  and  WL  parameters 
are  a  direct  result  of  vocal  activity.     Thus,   these  two 
parameter  groups  should  measure  different  speech  characteris 
tics . 

Parameter  Selection 
Techniques 

The  experimental  selection  of  parameters  for  this 
second  investigation  is  the  same  as  that  utilized  in  the 
initial  experiment.     As  stated  previously,  a  statistical 
procedure  was  utilized  in  order  to  maximize  the  level  of 
correct  speaker  identification  and  to  reduce  the  number  of 
parameters  needed  to  make  the  identifications.     In  this 
case,  a  stepwise  statistical  technique  of  parameter  in- 
clusion was  carried  out  during  the  discriminant  analysis. 
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A  determination  of  predictiveness  for  each  parameter  and/or 
group  of  parameters  was  conducted  utilizing  an  F-statistic 
with  a  computation  formula  that  may  be  found  in  Forsythe 
etal.    (1973).     In  accordance  with  this  system  of  parameter 
selection,  only  the  most  predictive  measurements  were 
entered  into  the  discrimination  process. 

As  was  seen  in  the  first  experiment,  only  a  small 
number  of  the  total  available  variables  were  utilized  in 
the  actual  discrimination  process.     Of  the  40  TED  parameters, 
19  produced  sufficiently  high  F-values  to  contribute  to  the 
identification  process.    A  point  of  interest,  the  parameters 
selected  in  the  laboratory -distorted  speech  were  not  the 
same  as  those  chosen  in  the  laboratory-normal  experiment. 
This  finding  indicates  that  the  set  of  parameters  varies 
with  the  circumstances  of  the  particular  population  and/or 
recording  conditions.     Therefore,  all  variables  should  be 
investigated  in  order  to  determine  which  set  will  prove 
the  most  efficient  in  the  classification  or  identification 
procedure . 

Examination  of  the  vector  combinations  demonstrate 
that  the  addition  of  new  variables  may  in  some  case  result 
in  the  exclusion  of  others.     For  example,  when  the  WL 
vector  was  combined  with  the  TED  vector,  four  parameters 
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formerly  included  in  the  single  TED  vector  were  excluded 
from  the  combination  vector  of  TED  X  WL.     In  this  particu- 
lar case  the  overall  classification  and  identification 
scores  remain  relatively  unchanged.     Therefore,  it  can  be 
concluded  that  total  voiced  speech  time  and  total  articula- 
tion time  contained  a  small  amount  of  redundant  information. 
However,  since  only  four  parameters  were  removed  it  also 
would  appear  that  not  very  much  TED  information  is  repli- 
cated by  the  WL  vector. 

Test  Sample  Selection 

The  classification  procedures  utilized  in  this  experi- 
ment (posterior,  jackknifed,  and  identification)  demonstrate 
the  importance  of  test  sample  selection.     As  was  seen  in 
the  first  experiment,  posterior  classification  of  the  speech 
samples  often  resulted  in  the  highest  set  of  scores.  An 
explanation  of  this  finding  may  be  found  in  the  method  and 
number  of  test  sample  selection.     This  method  involved  all 
samples  being  used  in  the  computation  of  the  reference  set. 
Also,  all  four  samples  were  utilized,  each  in  turn,  as 
test  samples,     thus,  affording  the  best  possible  method 
of  speaking  selection. 
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The  jackknifed  method  of  speaker  selection  resulted 
in  slightly  lower  classification;  this  finding  is  in  agree- 
ment with  the  first  experiment.     That  is,  in  both  the  first 
and  second  phases  of  this  research  project,  the  jackknifed 
procedure  demonstrated  lower  levels  of  speaker  identifi- 
cation than  did  the  posterior  classifications.     The  con- 
sistency of  these  pretesting  methods  lies  in  the  fact  that 
they  are  not  dependent  upon  the  type  of  information  utilized. 
Rather,  classification  scores  are  affected  primarily  by 
the  method  by  which  the  test  sample  was  selected. 

As  was  expected,  the  identification  task  yielded  the 
lowest  set  of  scores.     Stated  in  an  earlier  section,  this 
procedure  utilized  only  the  first  of  four  speech  samples  as 
the  tests.     Therefore,  speaker  selection  for  the  identifi- 
cation task  was  based  entirely  on  the  information  measured 
in  the  first  sample.     In  order  to  test  the  reliability  of 
this  identification  task,  a  second  identification  task  was 
carried  out  (this  was  also  done  in  experiment  one) .  In 
this  case  the  fourth  speech  sample  was  utilized  as  the  test 
sample.     The  results  of  this  task  may  be  seen  in  Table  8. 
Comparing  both  the  first  and  second  identification  tech- 
niques indicates  that  the  speaker  predictiveness  is  not 
consistent  through  all  the  speech  samples.     That  is,  the 


93 


Table  8.     Identification  Scores  for  the  Laboratory-Distorted 
Speech  Experiment  Obtained  Utilizing  the  Fourth 
Speech  Sample  as  the  Test;  Only  the  TED,  WL,  and 
TED  X  WL  Vectors  Were  Examined;  N  =  20;  All  Scores 
Are  in  Percentages. 


Vectors 

Normal 

Identifications 
Stress 

Disguise 

TED 

60.0 

15.0 

25.0 

WL 

10.0 

35.0 

10.0 

TED  X  WL 

65.0 

25.0 

25.0 
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first  sample  is  not  necessarily  the  most  idiosyncratic  of 
a  given  talker's  speech.     There  also  is  indications  that 
the  vector  effectiveness  varies  from  sample  to  sample. 
Specifically,  the  TED  vector  demonstrated  some  decrease  in 
identification  level  while  the  WL  vector  shows  an  increase. 
The  findings  appear  to  indicate  that  the  selection  of  a 
test  sample  may  be  critical  for  certain  speaker-dependent 
measures.     In  addition  it  should  be  noted  that  the  more 
difficult  the  identification  task  becomes,  the  lower  the 
identification  scores  become. 

In  examining  the  findings  of  this  distorted  speech 
experiment,  the  following  statements  may  be  concluded. 

1.  Under  all  three  speaking  conditions,  normal, 
stress,  and  disguise,  the  time-energy  distri- 
bution (TED)  vector  and,  to  a  lesser  degree, 
the  voiced/voiceless  speech  time  (WL)  vector 
are  effective  as  speaker  identification  cues. 

A.  The  vowel/consonant  duration  ratio  (V/C) 
and  the  word  and  phrase  duration  (WPD) 
vectors  did  not  appear  to  function 
reasonably  as  predictors  of  a  talker's 
identity. 

B.  The  findings  of  this  experiment  are  in 
general  agreement  with  those  of  the  first 
experiment  and  those  of  previous 
researchers . 

2.  Stressful  and  disguised  speaking  conditions  appear 
to  have  deleterious  effects  on  the  speaker  identi- 
fication abilities  of  the  selected  temporal 
parameters . 
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A.  The  degrading  effects  reduced  the  levels 
of  identification  by  varying  degrees, 
depending  on  the  sample  selection  and 
vector  utilized  in  the  test. 

B.  The  temporal  parameters  utilized  in  this 
experiment  seem  to  be  able  to  identify  a 
disguised  voice  with  a  higher  degree  of 
accuracy  than  a  previous  set  of  vectors 
known  to  this  author. 


CHAPTER  V 


THE  RESULTS  AND  DISCUSSION  OF  THE 
SEMI -FIELD  EXPERIMENT 

The  final  experiment  of  this  research  program  constitu- 
ted an  attempt  to  test  the  speaker  identification  capabili- 
ties of  the  specified  temporal  parameters  in  a  situation 
that  would  parallel  those  conditions  that  can  be  found  in 
the  forensic  model.    As  cited  previously,  a  speaker  (the 
unknown)   simulated  a  "crime"  over  the  telephone;  the  known 
exemplars  were  simulated  by  an  interrogation  procedure 
carried  out  later.     In  other  words,  an  unknown  caller  was 
recorded  over  the  telephone  and  the  suspect  pool  was  record- 
ed   during  a  reading  session  which  was  set  up  to  simulate 
an  interrogation.     The  time-energy  distribution  (TED)  and 
the  voiced/voiceless  speech  time   (WL)  vectors  were  applied 
to  these  recordings  for  use  in  the  identification  process. 
The  vowel/consonant  duration  ratio  and  the  word  and  phrase 
duration  vectors  were  not  utilized  in  this  experiment.  As 
these  vectors  yielded  few,  if  any,  correct  speaker  identi- 
fications in  either  of  the  two  previous  experiments, 
therefore,  it  was  concluded  that  these  vectors   (V/C  and  WPD) 
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are  ineffectual    as  speaker  identification  cues.  Moreover, 
the  examples  recorded  were  limited  in  word  content  and  did 
not  contain  words  and  phrases  that  were  repeated  an 
adequate  number  of  times.     It  should  be  noted  that  the 
unknown  speaker  also  was  recorded  as  one  of  the  known 
exemplars;  such  a  procedure  permitted  the  identifications 
to  be  made  on  a  closed  set. 

Results 

The  simulated  field  situation  presented  some  unique 
problems.     Some  of  these  restrictive  conditions  were 
important  in  determining  the  discrimination  or  identification 
procedure  that  could  be  applied  to  the  data.     For  example, 
while  the  "unknown"  call  lasted  over  two  minutes,  the 
recordings  of  the  "suspects"  seldom  were  longer  than  60 
seconds  in  duration.     Therefore,  the  "suspect"  recordings 
could  not  be  utilized  to  develop  the  needed  reference  sets. 
This  problem  was  circumvented  by  using  the  unknown  call  as 
the  reference  set.     That  is,  the  tape  recording  of  the 
criminal  call  was  divided  into  four  30  second  samples  and 
these  samples  were  used  to  generate  the  reference.  In 
turn,  the  recordings  made  by  the  suspects  were  used  as 
tests  and  only  one  30  second  sample  per  talker  was 
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necessary  for  the  process.     Another  problem  existed  because 
there  was  only  one  correct  speaker  selection  and  the  identi- 
fication was  a  binary      decision.     Thus,  it  could  be  either 
right  or  wrong;  100  per  cent  or  0  per  cent  correct.  There- 
fore, a  ranking  system  was  utilized  that  would  demonstrate 
the  vocal  similarities  between  any  of  the  known  speakers 
and  the  unknown  speaker. 

The  results  of  this  ranking  method  may  be  found  in 
Table  9.     As  can  be  seen  from  the  table,  the  TED  vector 
ranked  suspect  No.  11  as  the  one  most  similar  to  the  unknown 
from  among  the  twelve  possible  suspects.     On  the  other  hand, 
the  correct  suspect  was  ranked  sixth.     Applying  the  WL 
vector  to  these  data  yielded  only  slightly  better  results. 
Again,  suspect  No.  11  was  judged  as  most  like  the  unknown; 
the  correct  choice,  suspect  No.  9  was  ranked  fifth. 
Combining  the  two  vectors  yielded  results  identical  to 
those  of  the  TED  vector  when  used  in  isolation. 

A  second  method  of  speaker  selection  also  was  investi- 
gated.    In  this  subsequent  set  of  tests,  a  cross-correla- 
tion procedure  was  utilized  as  the  statistical  technique 
rather  than  discriminant  analysis.     The  algorithm  used 
to  generate  the  correlations  was  the  same  as  that  employed 
by  Zalewski  et  al.    (1975) .     The  parameters  selected  for 
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Table  9.     Identification  Ranking  for  the  Semi-Field  Experi- 
ment Obtained  Utilizing  Discriminant  Analysis  on 
the  Time-Energy  Distribution  (TED)  and  the  Voiced/ 
Voiceless  Speech  Time  (WL)  Vectors,  N  =  12. 


Vectors 


Suspect 
Identified 
as  the 
Unknown 


TED 
WL 

TED  X  WL 


11 
11 
11 


Correct 
Choice 


Ranking 
of  Correct 
Choice 


9 
9 
9 


6 
5 
6 
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inclusion  in  the  correlation  procedure  were  chosen  on  the 
basis  of  the  previous  experiments.     Of  the  40  TED  vectors, 
only  a  limited  number  (17)  were  entered  into  the  statis- 
tical technique.     In  the  case  of  the  WL  vector,  both  total 
articulation  time  and  total  voiced  speech  time  were  included 
in  the  correlation  process. 

The  results  of  this  cross-correlation  technique  are 
enumerated  in  Table  10.     Again,  the  relative  speaker 
identification  effectiveness  of  the  vector  is  represented 
as  a  function  of  its  suspect  rankings.     In  this  case,  TED 
chose  suspect  No.  6  as  the  unknown;  the  correlation 
coefficient  was  .9628.     However,  the  correct  choice, 
suspect  No.  9,  was  ranked  second  with  a  correlation 
coefficient  of  .9611.     The  WL  vector  selected  the 
correct  suspect  as  being  most  similar  to  the  unknown 
speaker.     The  correlation  coefficient  in  this  case  was 
1.0000.     Rankings  based  on  the  TED  X  WL  vector  combina- 
tion resulted  in  suspect  No.  11  being  chosen  as  the  un- 
known; suspect  No.  9   (correct  choice)  was  ranked  fourth. 
The  correlation  coefficients  for  this  vector  grouping 
were  .9391  and  .8675,  respectively. 
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Table  10.     Identification  Rankings  for  the  Semi -Field  Ex- 


periment Obtained  Utilizing  Cross -Correlations 
on  the  TED  and  WL  Vectors,  the  Correlation 
Coefficients  Between  the  Known  and  Unknown 
Speakers  Are  Also  Listed,  N  =  12 . 


Suspect 

Identified  Ranking 
as  the  Correct        of  Correct 


Vectors 


Unknown 


Choice 


Choice 


TED 


6  (.9628) 


9  (.9611) 


2 


WL 


9  (1.000) 


9  (1.000) 


1 


TED  X  WL 


11  (.9391) 


9  (.8675) 


4 
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Discussion 

This  third  and  final  experiment  provided  the  greatest 
challenge  to  the  speaker  identification  capabilities  of 
the  selected  temporal  parameters.     In  the  first  two  sets 
of  experiments,  discriminant  analysis  was  utilized  to  test 
the  power  of  these  vectors  as  cues  for  speaker  identification. 
Hence,  this  approach  also  was  employed  in  this,   the  third, 
experiment.     However,  the  vectors  provided  only  very  low 
levels  of  correct  identification.     A  possible  explanation 
for  these  low  scores  may  be  found,  at  least  in  part,  in  the 
method  of  parameter  selection.     Generally,  in  discriminant 
analysis  the  number  of  parameters  included  in  the  dis- 
crimination process  may  not  exceed  one-half  the  number  of 
reference  samples.     In  this  particular  case,  only  the  four 
speech  samples  of  the  unknown  speaker  were  utilized  to 
generate  the  reference  set.     Therefore,  only  two  parameters 
could  be  included  from  among  the  vectors.     It  would  appear 
from  the  findings  that  the  inclusion  of  only  a  few  parameters 
did  not  provide  enough  speaker  dependent  information  to 
match  the  correct  known  speaker  to  the  unknown. 

In  an  attempt  to  force  the  inclusion  of  more  varia- 
bles, a  cross-correlation  procedure  was  applied  to  the 
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data.     This  method  was  selected  because  it  permits  the 
application  of  any  number  of  parameters  to  statistical 
process,     as  was  seen  in  Table  10,  the  second  method  of 
speaker  selection  resulted  in  improved  levels  of  identifi- 
cation.    Possibly  this  finding  is  a  direct  result  of  the 
addition  of  more  speech  characteristics  to  the  identifi- 
cation process,-     an  explanation  which  seems  plausible  be- 
cause the  cross -correlation  method  permits  more  speech 
parameters  to  be  utilized  in  the  identification  process 
(2  vs .  17  parameters).     It  also  should  be  noted  that  the 
correlation  coefficients  produced  using  the  TED  vector  were 
only  .0017  apart.     Therefore,  these  two  suspects  were  both 
about  equally  correlated  to  the  unknown  speaker.  This 
finding  demonstrates  that  some  ambiguity  does  exist  between 
the  suspects  and  the  unknown  speaker's  speech  patterns. 
These  results  may  also  indicate  that  cross-correlations  are 
not  sensitive  to  some  of  the  more  subtle  talker  variations. 

It  is  difficult  to  compare  the  results  obtained  for 
this  semi-field  experiment  with  those  provided  by  other 
research.     Few  studies  have  been  carried  out  investigating 
speaker  identification  within  the  forensic  model.  However, 
one  such  study  has  been  reported  by  Johnson,  Hollien,  and 
Doherty  (1977).     These  authors,  using  this  same  data  base, 
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employed  long-term  speech  spectra  as  a  speaker  identifica- 
tion cue.     The  data  provided  by  this  study  demonstrated  very 
low  levels  of  speaker  identification.     Johnson  et  al. 
concluded  that  most  of  the  degradation  produced  was  due 
to  the  poor  recording  conditions  utilized  in  the  gathering 
of  the  speech  samples.     Therefore,  it  may  be  assumed  that 
these  same  poor  recording  conditions  contributed,  at  least  in 
part,  to  the  low  levels  of  identification  reported  in  the 
present  experiment. 

To  summarize  the  results  of  this  simulated  field  experi- 
ment, the  following  statements  may  be  made. 

1.  The  TED  and  WL  vectors  demonstrated  relevance 
to  the  speaker  identification  task.     The  power 
levels  of  these  vectors  were  not  high  but  some 
relevance  even  under  these  simulated  field 
situations  was  found. 

2.  When  very  small  numbers  of  samples  are  available, 
discriminant  analysis  may  not  yield  the  highest 
possible  levels  of  speaker  identification. 

3.  It  appears  that  when  very  few  controls  are 
employed  in  gathering  the  data  base,  the  levels 
of  speaker  identification  may  be  reduced. 


CHAPTER  VI 


SUMMARY  AND  CONCLUSION 

The  basic  purpose  of  this  research  project  was  to 
seek  potential  invariant  characteristics  within  the  tem- 
poral elements  of  an  individual's  speech,  which  permit 
him  to  be  identified  from  his  voice  alone.  Generally, 
these  temporal  measurements  included  duration  analysis  of 

(1)  relative  energy  at  several  levels  of  intensity  (TED) , 

(2)  voiced  and  voiceless  activity  (WL)  ,    (3)  vowel/con- 
sonant ratios   (V/C) ,  and  (4)  specific  words  and  phrases. 
In  more  specific  terms,  the  TED  vector  was  based  on  a 
group  (40)  of  time-energy  measurements;  this  analysis 
reflected  the  total  accumulated  time  a  talker  remains  at 
a  specific  energy  level.     The  WL  vector  was  made  up  of 
two  parameters  which  represented  the  total  duration  of 
voiced  and  voiceless  activity  during  a  speech  sample.  The 
third  vector,  V/C,  was  composed  of  the  ratios  of  the 
duration  of  selected  vowels  to  the  duration  of  their 
consonantal  environment.     For  the  purposes  of  this  vector, 
four  specific  words  were  utilized  in  the  formation  of  the 
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V/C  ratios.     Finally,  the  WPD  vector  consisted  of  the 
overall  duration  of  several  words  and  phrases.     In  this 
case  four  words  and  three  phrases  were  chosen  for  use  in 

the  WPD  vector. 

The  selected  temporal  vector  were  investigated  under 
a  variety  of  speaking  conditions.     Three  experiments  were 
utilized  to  study  these  conditions:     (1)  laboratory-normal, 
(2)   laboratory-distorted-speech,  and  (3)  semi-field. 

In  the  initial  experiment,  the  subjects   (40  adult 
males)  read  speech  material  while  being  recorded  in  an 
"ideal"  laboratory  environment.     The  aim  of  this  experiment 
was  to  develop  baseline  data  on  the  selected  temporal 
parameters.     The  findings  in  this  case  resulted  in  100  per 
cent  to  25  per  cent  correct  identification  scores  for  the 
TED  vector.     The  WL  vector  yielded  levels  of  identifi- 
cation in  57  per  cent  to  25  per  cent  range.     The  remaining 
two  vectors  demonstrated  very  low  identification  power. 
The  vector  combinations  yielded  improved  speaker  identifi- 
cation scores.     Specifically,  the  identification  tasks 
showed  the  greatest  increases.     The  TED     X  WL  X  V/C 
vector  combination  identified  19  of  40   speakers.  This 
represents  about  a  50  per  cent  increase  over  any  of  the 
vectors  when  used  in  isolation. 
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The  purpose  of  the  laboratory-distorted  speech  ex- 
periment was  to  examine  the  effects  of  speech  distortion 
on  the  robustness  of  the  temporal  vectors.     The  speakers 
(20  adult  males)  were  recorded  under  the  same  laboratory 
condition  as  the  first  experiment.     However,  in  this  case, 
three  speaking  conditions  were  imposed  upon  the  speakers; 
normal,  stress,  and  disguise.    When  the  TED  parameters  were 
applied  to  this  normal  data,  100  per  cent  to  40  per  cent  of 
the  subjects  were  correctly  identified.     Application  of  the 
WL  vector  resulted  in  correct  identification  scores  of 
from  65  per  cent  to  7.5  per  cent.     The  remaining  two  vectors 
(V/C  and  WPD)  resulted  in  no  correct  identification  at 
all.     In  the  various  vector  combinations,  the  correct 
identification  rates  were  100  per  cent  to  15  per  cent.  The 
levels  of  identification  obtained  in  the  stress  and  dis- 
guise conditions  were  substantially  reduced.  However, 
the  same  trends  which  were  exhibited  in  the  normal  speak- 
ing condition  were  found  in  both  the  stress  and  disguise 
data.     In  addition,  the  TED  X  WL  vector  combination 
correctly  classified  40  per  cent  of  the  disguised  voices 
to  their  normal  counterparts. 

The  third  and  final  experiment  was  included  in  this 
research  program  in  order  to  test  the  speaker  identification 
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capabilities  of  the  selected  temporal  parameters  under 
conditions  which  would  parallel  more  closely  those  found 
in  the  forensic  model.     In  this  case,  a  speaker  simulated 
a  "crime"  over  the  telephone,  later,  a  suspect      pool  was 
created  by  recordings  made  at  a  simulated  interrogation. 
The  TED  vector  under  this  situation  was  unable  to  identify 
the  unknown  caller  from  the  suspect  pool.     On  the  other 
hand,  the  WL  vector  did  correctly  match  the  known  suspect 
with  the  caller.     The  V/C  and  WPD  vector  were  not  utilized 
in  this  experiment  because  of  some  restrictive  recording 
conditions . 

It  appears  from  the  findings  of  these  three  experiments 
that  the  questions  stated  at  the  onset  of  this  project  have 
been  answered,  at  least  within  the  constraints  of  this 
study.     It  is  apparent  that  temporal  speech  characteristics 
play  a  role  in  the  speaker  identification  process.  Moreover, 
the  selected  temporal  parameters  are  able  to  identify  a 
speaker  (recorded  under  normal  conditions)  with  a  fair 
degree  of  accuracy.     Therefore,   it  must  be  concluded  that 
these  vectors  do  measure  some  invariant,  idiosyncratic 
characteristics  of  an  individual's  speech  repertoire. 

In  addition,  these  experiments  demonstrated  that  the 
levels  of  speaker  identification  are  reduced  when  a  speaker 
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is  placed  in  a  stressful  situation  or  when  he  attempts  to 
disguise  his  voice.     Thus,  under  these  conditions,  the 
talker  alters   (voluntarily  or  involuntarily)  certain  tem- 
poral elements  of  the  speech  waveform.     An  important  result 
was  produced  in  the  laboratory-distorted  speech  experiment. 
That  is,  the  TED  X  WL  vector  combination  resulted  in  the 
highest  identification  rate  of  any  study  known  to  the  author. 
Therefore,  it  must  be  concluded  that  the  temporal  vectors, 
measured  in  this  study,  are  less  affected  by  vocal  disguise 
than  certain  frequency  elements  of  speech  previously  inves- 
tigated. 

Finally,  the  results  of  the  third  experiment  demon- 
strated the  effects  of  simulated  forensic  conditions  upon 
the  selected  temporal  parameters.     From  these  findings 
it  must  be  concluded  that  poor  recording  conditions  and 
sampling  techniques  degrade  the  speaker  identification 
powers  of  the  selected  temporal  parameters. 

In  conclusion,  this  research  program  appears  to  have 
succeeded  in  demonstrating  the  role  of  temporal  measure- 
ment  (in  various  situations)   in  the  speaker  identification 
process.     However,   it  also  has  shown  that  these  temporal 
vectors  are  inadequate  identification  cues  to  be  utilized 


solely  in  a  speaker  identification  system.  Therefore, 
the  aim  of  further  investigations  into  temporal  measure 
ment  should  examine  their  possible  incorporation  into 
speaker  identification  systems  which  utilize  frequency 
parameters . 


REFERENCES 


Atal,  B.  S.    Automatic  Speaker  Recognition  Based  on  Pitch 

Contours.     J.  Acoust.  Soc.  Amer.,   52,   1687-1697,  1972. 

Black,  J.  W.,  Lashbrook,  W.,  Nash,  W.,  Oyer,  H.  J.,  Pedrey, 

C,  Tosi,  0.  I.,  and  Truby,  H.     Reply  to  Speaker  Identi- 
fication by  Speech  Spectrograms:     Some  Further  Observa- 
tions.    J.  Acoust.  Soc.  Amer.,   54,   535-537,  1973. 

Bolt,  R.  H. ,  Cooper,  F.  S.,  David,  E.  C,  Denes,   P.  B., 

Pickett,  J.  M. ,  and  Stevens,  K.  N.     Speaker  Identifi- 
cation by  Speech  Spectrograms.     J.  Acoust.  Soc.  Amer., 
47,   597-613,  1970. 

Bolt,  R.  H.,  Cooper,  F.  S.,  David,  E.  C,  Denes,  P.  B., 

Pickett,  J.  M.,  and  Stevens,  K.  N.  Speaker  Identifi- 
cation by  Speech  Spectrograms:  Some  Further  Observa- 
tions, J.  Acoust.  Soc.  Amer.,  54,   531-534,  1973. 

Bricker,  P.  and  Pruzansky,  S.     Effects  of  Stimulus  Content 

and  Duration  on  Talker  Identification.     J.  Acoust.  Soc. 
Amer.,  40,   1441-1450,  1966. 

Clarke,  F.  R.  and  Becker,  R.  W.     Comparison  of  Techniques  for 
Discriminating  Among  Talkers.     J.  Speech  Hearing  Res., 
12,   747-761,  1969. 

Coleman,  R.  0..   Speaker  Identification  in  the  Absence  of 

Intersubject  Differences  in  Glottal  Source  Characteris- 
tics, J.  Acoust.  Soc.  Amer.,   53,   1741-1743,  1973. 

Compton,  A.  J.     Effects  of  Filtering  and  Vocal  Duration  Upon 
the  Identification  of  Speakers  Aurally.     J.  Acoust.  Soc. 
Amer.,   35,   1748-1752,  1963. 

Doherty,  E.  T.     An  Evaluation  of  Selected  Acoustic  Parameters 
for  Use  in  Speaker  Identification.     J.  Phonetics,  4, 
321-326,  1976. 


Ill 


112 


Doherty,  E.  T.  and  Hollien,  H.    Multiple-Factor  Speaker 
Identification  of  Normal  and  Distorted  Speech, 
J.  Phonetics,   6,   1-8,  1978. 

Forsythe,  A.  B.,  Engelman,  L.,  Jennrich,  R.,  and  May, 

P.  R.  A.     A  Stopping  Rule  for  Variable  Selection  in 
Multiple  Regression.     J.  Amer.  Stat.  Assoc.,  68, 
75-77,  1973. 

Glenn,  J.  w.  and  Kliener,  N.     Speaker  Identification  Based  on 
Nasal  Phonations.     J.  Acoust.  Soc.  Amer.,  43,  368-372, 
1968. 

Goldstein,  U.  G.     Speaker-Identifying  Features  Based  on 

Formant  Tracks.     J.  Acoust.  Soc.  Amer.,  59,  176-182, 
1976. 

Grey,  C.  and  Kopp,  G.    Voiceprint  Identification.  Report 
presented  to  the  Bell  Telephone  Laboratory,  Inc., 
1-14,  1944. 

Hazen,  B.  M.     Effects  of  Differing  Phonetic  Context  on 

Spectrographic  Speaker  Identification.     J.  Acoust.  Soc. 
Amer.,   54,  650-660,  1973. 

Hollien,  H.     Peculiar  Case  of  "Voiceprints . "    J.  Acoust.  Soc. 
Amer.,   56,   210-213,  1974. 

Hollien,  H.     Status  Report  of  "Voiceprint"  Identification 

in  the  United  States.     Proceedings,   International  Con- 
ference on  Crime  Countermeasures ,  Science  and  Engineer- 
ing.    Oxford,  England,  July  25-29,  1977. 

Hollien,  H.  and  Majewski,  W.     Speaker  Identification  by  Long- 
Term  Speech  Spectra  under  Normal  and  Distorted  Speech 
Conditions.     J.  Acoust.  Soc.  Amer.,   62,   975-980,  1977. 

Hollien,  H.,  Majewski,  W.,  and  Hollien,  P.  Perceptual 

Identification  of  Voice  under  Normal,  Stress  and  Dis- 
guise Speaking  Conditions.     J.  Acoust.  Soc.  Amer., 
56,   553,  1974. 

Hollien,  H.  and  McGlone,  R.  E.     The  Effects  of  Disguise  on 
"Voiceprint"  Identification.     Nat.  J.  Crim.  Def.,  2, 
117-130,  1976. 


113 


Houlihan,  K.     The  Effects  of  Disguise  on  Speaker  Identifi- 
cation from  Sound  Spectrograms.     Proceedings,  IPS-77, 
Miami  Beach,  Florida,  December  17-19,   1977   (in  press) . 

lies,  M.     Speaker  Identification  as  a  Function  of  Fundamental 
Frequency  and  Resonant  Frequencies,  Ph.D.  Dissertation, 
University  of  Florida,  1972. 

Johnson,  C.  C,  Hollien,  H. ,  and  Doherty,  E.  T.  Long-term 
Power  Spectra  and  Formant  Tracks  as  Speaker  Identi- 
fication Cues  in  Simulated  Forensic  Situations. 
Occasionally,   2,  41-43,  1977. 

Kersta,  L.  G.     Voiceprint  Identification.     Nature,  196, 
1253-1257,  1962. 

LaRiviere,  C.  L.     Speaker  Identification  from  Turbulent 

Portions  of  Fricatives.     Phonetica,  29,  246-252,  1974. 

LaRiviere,  C.  L.     Contributions  of  Fundamental  Frequency 
and  Formant  Frequencies  to  Speaker  Identification. 
J.  Phonetics,   31,   185-197,  1975. 

Lehiste,   I.     Reading  in  Acoustic  Phonetics.     The  MIT  Press, 
Cambridge,  Mass.,   358p.,  1967. 

Majewski,  W.  and  Hollien,  H.     Euclidean  Distances  Between 
Long-term  Speech  Spectra  as  a  Criterion  for  Speaker 
Identification.     Proceedings,  Speech  Communication 
Seminar  -  74.     Stockholm,  Sweden,  202-210,  1974. 

McGehee,  F.     The  Reliability  of  the  Identification  of  the 
Human  Voice.     J.  Gen.  Psychol.,   17,   246-271,  1937. 

McGlone,  R.  E.,  Hollien,  P.,  and  Hollien,  H.  Acoustic 

Analysis  of  Voice  Disguise  Related  to  Voice  Identifi- 
cation, Proceedings,   International  Conference  on  Crime 
Countermeasures ,  Science  and  Engineering,  Oxford, 
England,  July  25-29,  1977. 


Pollack,   I.,  Pickett,  J.  M.,  and  Sumby,  W.  H.     On  the  Identi- 
fication of  Speakers  by  Voice.     J.  Acoust.  Soc.  Amer., 
26,   403-412,  1954. 


114 


Potter,  R.  K. ,  Kopp,  G.,  and  Kopp,  H.  G.     Visible  Speech. 
Dover  Press,  New  York,  N.  Y. ,  439p.,  1966. 

Pruzansky,  S.     Pattern  Matching  Procedure  for  Automatic 
Talker  Recognition.     J.  Acoust.  Soc.  Amer.,  35, 
354-358,  1963. 

Reich,  A.  R.,  Moll,  K.  L.,  and  Curtis,  J.  F.     Effects  of 
Selected  Vocal  Disguises  upon  Spectrographic  Speaker 
Identification.     J.  Acoust.  Soc.  Amer.,  60,  919-925, 
1976. 

Rothman,  H.  B.    A  Perceptual  (Aural)  and  Spectrographic  Iden- 
tification of  Talkers  with  Similar  Sounding  Voices. 
Proceedings,  International  Carnahan  Conference  on  Crime 
Countermeasures ,  Oxford,  England,  1977. 

Sambur,  M.  R.     Selection  of  Acoustic  Features  for  Speaker 
Identification.     IEEE  Transactions  on  Acoustics, 
Speech,  and  Signal  Processing,  ASSP-23,   169-176,  1975. 

Stevens,  K.  N. ,  Williams,  C.  E.,  Carbonnell,  J.  R.,  and 

Woods,  D.  Speaker  Identification  and  Authentication: 
A  Comparison  of  Spectrographic  and  Auditory  Presenta- 
tion of  Speech  Materials.  J.  Acoust.  Soc.  Amer.,  44, 
1596-1607,  1968. 

Tosi,  0.,  Oyer,  H.,  Lashbrook,  W.,  Predrey,  C,  and  Nash,  W. 
Experiment  on  Voice  Identification.     J.  Acoust.  Soc. 
Amer.,   51,  2030-2043,  1972. 

Wolf,  J.  J.     Efficient  Acoustic  Parameters  for  Speaker 

Recognition.     J.  Acoust.  Soc.  Amer.,   51,  2044-2056, 
1972. 

Young,  M.  A.  and  Campbell,  R.  A.  Effects  of  Context  on 
Talker  Identification.  J.  Acoust.  Soc.  Amer.,  42, 
1238-1254,  1967. 

Zalewski,  J.,  Majewski,  W. ,  and  Hollien,  H.  Cross-Correla- 
tion Between  Long-term  Speech  Spectra  as  a  Criterion 
for  Speaker  Identification.     Acoustica,   34,  20-24, 
1975. 


BIOGRAPHICAL  SKETCH 


Charles  Clifford  Johnson,  Jr.  was  born  on  September  17, 
1948,  in  Glen  Cove,  New  York.     He  graduated  from  Flushing 
High  School  in  1967.     He  received  an  Associate  of  Arts 
degree  from  Manhattan  Community  College  in  June  1969. 
From  September  1969  to  June  1971,  Mr.  Johnson  attended 
City  College  of  New  York  and  received  a  Bachelor  of  Science 
degree  in  Biological  Oceanography  in  1971.     During  the 
years  from  1971  to  1973,  he  pursued  and  received  a  Master 
of  Science  degree  in  Marine  Science  from  C.  W.  Post  Center, 
Long  Island  University.     In  June  1973,  he  entered  a 
doctorate  program  at  the  University  of  Florida,  and  has 
since  pursued  work  for  a  Doctor  of  Philosophy  degree  at 
the  Institute  for  Advanced  Study  of  the  Communication 
Processes . 

On  June  3,  1972,  Mr.  Johnson  was  married  to  Christine 
Olga  Ginal.     They  now  have  two  children,  Charles  Clifford  III 
and  Cristen  Elizabeth. 


115 


I  certify  that  I  have  read  this  study  and  that  in  my 
opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  adequate,   in  scope  and  quality, 
as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 

Id — ?  I4UL~: 

Harry  Hollien,  Chairman 
Professor  of  Linguistics 


I  certify  that  I  have  read  this  study  and  that  in  my 
opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  adequate,   in  scope  and  quality, 
as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 

Howard  B.  Rothman 

Associate  Professor  of  Speech 


I  certify  that  I  have  read  this  study  and  that  in  my 
opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  ^adequate ,  in  scope  and  quality, 
as  a  dissertation  for  the7  degree  of  Doct0r  of  Philosophy. 


Professor  of  Psychology 

/ 


I  certify  that  I  have  read  this  study  and  that  in  my 
opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  adequate,   in  scope  and  quality, 
as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Alan  Agresti 
Associate  Professor  of  Statistics 


I  certify  that  I  have  read  this  study  and  that  in  my 
opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


This  dissertation  was  submitted  to  the  Graduate  Faculty  of 
the  Department  of  Speech  in  the  College  of  Arts  and  Sciences 
and  to  the  Graduate  Council,  and  was  accepted  as  partial 
fulfillment  of  the  requirements  for  the  degree  of  Doctor 
of  Philosophy. 

August,  1978 


W.  S.  Brown 

Associate  Professor  of  Speech 


Dean,  Graduate  School 


